Cortex-A9 Processor Microarchitecture

Shared by: bnmbgtrtr52
-
Stats
views:
681
posted:
3/10/2011
language:
English
pages:
16
Document Sample
scope of work template
							Cortex-A9 Processor
 Microarchitecture

   Cortex A9 Technical Lead
         October 2007




                              1
Next Generation Microarchitecture
 Requirements
   Support delivery of both single core, and MPCore implementations
      Suitable for broad multiple class of devices

   Cortex A-class application processor compatibility
      Support NEON advanced SIMD engine
      Maintain Jazelle DBX and FPU only options

   50-60% per cycle performance uplift from ARM1176
      Plus significant Floating-Point uplift

   Maintain ease of synthesis implementation
      Leaving design MHz similar to ARM11 devices

                                                           2
Baseline Comparison
 ARM1176F
    8 stage pipeline, single issue, in-order
    500MHz in TSMC90G, Advantage (9 track), 32K caches
        1.2 DMIPS/MHz
        0.34 mW/DMIP
        200 DMIPS / mm2

 There is a common front end
  pipeline and 3 separate in-
  order back end pipelines:                                   Shift   ALU    Saturate


      ALU pipeline                                                                     Write
                                                                                        back

      Multiply pipeline        Fetch
                                  1
                                        Fetch
                                          2
                                              Decode Issue
                                                              MAC
                                                               1
                                                                      MAC
                                                                       2
                                                                              MAC
                                                                               3

      Load/store pipeline
                                                                      Data    Data
                                                                                        Write
                                                             Address Cache   Cache
                                                                                        Back
                                                                       1       2




                                                                               3
Design Considerations
 Look at the system cost, not just the amount of processor logic
    Weigh the cost of full symmetric dual-issue vs. asymmetric
    Instruction occurrences, load/store vs. computation

 Better support signal processing and media routines
    Reduce the load-use penalty
    Accelerate ‘tight loops’

 Importance of interface to memory
      Streaming of cached and uncached data
      Outstanding transactions
      Hiding memory latencies
      Support for coherence with accelerators and DMA


                                                              4
Microarchitecture Overview
 Variable length, out of order, 8 stage superscalar pipeline
    Advanced prefetch with parallel branch pipeline enabling early branch
       prediction and resolution
      Multi-issued into:
          Primary data processing pipeline
          Secondary full data processing pipeline
          Load-store pipeline
          Compute engine (FPU/NEON) pipeline
 Speculative execution
    Supporting virtual renaming of ARM physical registers removing
       pipeline stalls due to data dependencies
          Removing dependency on compiler instruction scheduling
          Increasing processor utilization and hiding of memory latencies
          Increased performance by hardware unrolling of code loops
      Reducing interrupt latency by speculative entry to ISR

                                                                5
 Cortex A9 Microarchitecture (single core variant)
                                                                                                                                                               IRQ/FIQ
                      CoreSight                                                Out of order
                                                                                                                               ALU/MUL
                                                                                                                                                                PL390
   Coresight /                                          Virtual and           multi-issue with                                   ALU-1
                   Debug Access Port
   JTAG                                                  physical              speculation
                                                                                                                                                               Interrupt
   Debug         Profiling Monitor Block
                                                       register pool                                                                                           Controller
                                                                                                                                    ALU
                                                                                                                                    ALU-1




                                                                                Instruction queue
                                                                                                                                                     OoO             …




                                                                                                          Dispatcher
                                                       Register
                     Dual-instruction                Rename stage                                                                                   Write
                                                                                                                                FPU/NEON            back
Cortex A9             Decode stage
                                                         Branch                                                                                     stage
Single core                                              Monitor
Processor                                                                      3+1 Dispatch
                                                                                                                                Address
                                                                                  stage
                 Instruction       Prediction
                   queue             queue
                                                                                    Auto-prefetcher

                          Prefetch stage                                                            Load-Store Unit                         µTLB
                      Fast-loop             Branch Prediction                 Store Buffer
                                                                                                                                                   Program
                     look-aside                                                quad-slot with forwarding                                    MMU
                                            Global History Buffer                                                                                   Trace
                  Instruction              BR-Target Addr Cache                                                                                      Unit
                     cache                     Return Stack                                                            Data Cache



                                                                                                                                                   Coresight
PL310                                             L2 Cache Control                                                                                  Trace
L2 Cache                           Bus Interface Unit (BIU)
                                                                                                                       ECC RAMs
Controller             Master interface             Secondary master (with filtering)

                                        AMBA 3 AXI 64bit



                                                                                                                                                      6
Cortex A9 Pipeline Structure
         1   1   1      2     1     1     1 to 3   1   Clocks in stage




                              Br




 Dispatching up to 4 instructions per clock cycle
       With out of order dispatch from the issue queue to ensure maximum
        utilization of subsequent pipeline stages
       With up to 7 instructions completing per cycle (including FPU/NEON)
 Compute engine (FPU/NEON) executing in parallel

                                                                         7
Branch Prediction Scheme
                                                                            Branch prediction using a Branch
                                                                             Target Address Cache (BTAC) and
                                                                             Global History Buffer (GHB)
                                                                              
                       GHB
                                                                                  Early in the pipeline to reduce the
                       BTAC
                                                                                  branch mispredict penalty.

                              Return
                                                                                 Benchmarked to confirm final size
                              stack
                                                                                       BTAC is 256 entries x 2ways to
PF0           PF1               PF2            PF3                                      1024 entries x 2ways
                                                     Prediction queue                  GHB between 1K to 4K entries
                                                                                 Return stack of 4 x 32bits
      Slot0    Slot1           Slot2   Slot3


                                                                            Optimization of small loops
                    Instruction
                                                     Instruction queue
                                                                                 Removal of branch penalty
                      Cache
                                                                                 No RAM power consumption


                                                                            Branch folding is done in Integer
                                                                             Core


                                                                                                          8
Register Renaming
 Removes hazards in the pipeline:
       name-dependencies (WAW)
       anti-dependencies (WAR)

 Simplifies stalling logic, as
  instructions issued to two different
  pipelines can complete
  irrespective of program order

 Avoids the need for stalling logic
  for branch resolution and
  load/store exception resolution

 Register renaming can fully unroll
  in hardware some simple code
  loops


                                         9
Decode and Issue Logic
 Dual symmetric decoders for ARM
  and Thumb2
       Supports ARM, Thumb2, Jazelle
        RCT and Jazelle DBX instruction
        sets
       Single decoder for Jazelle DBX




 Issue fed two instructions per cycle
  with dispatching up to 4 instructions
       Out of order selection of instructions        Branch Resolution
       Less sensitive to data latencies (eg.
        L1 cache misses)

 Branches resolved separately from                   VFP/NEON


  instruction issue                                       LS




                                                 10
Data Processing Pipeline
 Dual symmetric arithmetic execution
  units
      Full forwarding path support
      Executes speculative instructions
      Includes additional 3-stage multiplier
       pipeline


 Variable length
      From 1 to 3 cycle execution time
      Most operations complete in 1 cycle


 The number of DP/LS/MUL pipelines
  carefully balanced to optimize
  performance against silicon cost with
  power efficiency targets

                                                11
Integration of Compute Engines
                                        Cortex A9 supports multiple
                                         compute engines
                                              VFPv3 Floating-Point Unit
                                              NEON Advanced SIMD
                                                 (includes FP Unit)

                                        Early in the pipeline to reduce
                                         compare-to-branch cost

                                        Single issue dispatch

                                        Supports speculation over
                                         branches and condition codes
     Floating-Point Unit integration
                                        Issued in parallel with core
                                         pipelines


                                                                12
Data Side Summary
                       PIPT data cache
                           No need for page colouring

                       Load store unit
                           Supports up to four simultaneous
                              requests
                             Speculative load/stores support
                             Internal data forwarding between
                              dependent instructions

                       Merging store buffer with forwarding
                           Reduces power consumption
                           Reduces bus transactions

                       Bus Interface Unit
                           Supports up to four outstanding line
                              fill requests
                             Adaptative R/W cache allocation
                             Automatic data prefetcher

                       Supports cache coherency




                                             13
 Cortex A9 Microarchitecture (single core variant)
                                                                                                                                                               IRQ/FIQ
                      CoreSight                                                Out of order
                                                                                                                               ALU/MUL
                                                                                                                                                                PL390
   Coresight /                                          Virtual and           multi-issue with                                   ALU-1
                   Debug Access Port
   JTAG                                                  physical              speculation
                                                                                                                                                               Interrupt
   Debug         Profiling Monitor Block
                                                       register pool                                                                                           Controller
                                                                                                                                    ALU
                                                                                                                                    ALU-1




                                                                                Instruction queue
                                                                                                                                                     OoO             …




                                                                                                          Dispatcher
                                                       Register
                     Dual-instruction                Rename stage                                                                                   Write
                                                                                                                                FPU/NEON            back
Cortex A9             Decode stage
                                                         Branch                                                                                     stage
Single core                                              Monitor
Processor                                                                      3+1 Dispatch
                                                                                                                                Address
                                                                                  stage
                 Instruction       Prediction
                   queue             queue
                                                                                    Auto-prefetcher

                          Prefetch stage                                                            Load-Store Unit                         µTLB
                      Fast-loop             Branch Prediction                 Store Buffer
                                                                                                                                                   Program
                     look-aside                                                quad-slot with forwarding                                    MMU
                                            Global History Buffer                                                                                   Trace
                  Instruction              BR-Target Addr Cache                                                                                      Unit
                     cache                     Return Stack                                                            Data Cache



                                                                                                                                                   Coresight
PL310                                             L2 Cache Control                                                                                  Trace
L2 Cache                           Bus Interface Unit (BIU)
                                                                                                                       ECC RAMs
Controller             Master interface             Secondary master (with filtering)

                                        AMBA 3 AXI 64bit



                                                                                                                                                     14
Project Design Improvements
 Baseline, ARM1176F
    500MHz in TSMC90G, Advantage                               2.0




                                           Factor Improvement
       (9 track), 32K caches
          1.2 DMIPS/MHz                                   1.5
          0.34 mW/DMIP
          200 DMIPS / mm2                                      1.0
      65nm and 45nm process further
       enhancing device capabilities
          Capable of solution over 1GHz                        0.5




                                                                      ARM1176




                                                                                       ARM1176




                                                                                                            ARM1176
                                                                                          Efficiency
                                                                                          Power




                                                                                                              (DMIPS/mm2)
                                                                                                              Performance Cost
                                                                           Dhrystone
 Results taken from current trials
  on pre-release Cortex A9 data


                                                                                                       15
Summary
 Cortex A9 is the 2008 Cortex premium application processor
      Provides a migration path for ARM9/ARM11 designs to ARMv7 with increased
       processor efficiency and additional performance headroom
          Brings together the architectural features of TrustZone and Thumb2 and
           multicore processing
      Offering significant peak performance through MPCore technology
          Capable of over 8000 aggregate DMIPs using 65nm foundry process
          While offering increased performance within a mobile 200mW budget

 Supporting options for optimized FPU and NEON compute-engines to
  increase application specific performance

 Utilized traditional EDA synthesis design tools for deployment rapid
  deployment of low-power and scalable performance processor designs



                                                                   16

						
Related docs
Other docs by bnmbgtrtr52
Swanleigh Organisational Chart
Views: 32  |  Downloads: 0
Characteristic Merketmodei Pure competition
Views: 1  |  Downloads: 0
Alpine Classic 2010
Views: 48  |  Downloads: 0
Chemical register
Views: 110  |  Downloads: 0
Which Is The Best Windows 7 Registry Tool-
Views: 39  |  Downloads: 0
ACADEMIC HONESTY AND PREVENTING
Views: 57  |  Downloads: 0