Cortex-A9 Processor Microarchitecture
Document Sample


Cortex-A9 Processor
Microarchitecture
Cortex A9 Technical Lead
October 2007
1
Next Generation Microarchitecture
Requirements
Support delivery of both single core, and MPCore implementations
Suitable for broad multiple class of devices
Cortex A-class application processor compatibility
Support NEON advanced SIMD engine
Maintain Jazelle DBX and FPU only options
50-60% per cycle performance uplift from ARM1176
Plus significant Floating-Point uplift
Maintain ease of synthesis implementation
Leaving design MHz similar to ARM11 devices
2
Baseline Comparison
ARM1176F
8 stage pipeline, single issue, in-order
500MHz in TSMC90G, Advantage (9 track), 32K caches
1.2 DMIPS/MHz
0.34 mW/DMIP
200 DMIPS / mm2
There is a common front end
pipeline and 3 separate in-
order back end pipelines: Shift ALU Saturate
ALU pipeline Write
back
Multiply pipeline Fetch
1
Fetch
2
Decode Issue
MAC
1
MAC
2
MAC
3
Load/store pipeline
Data Data
Write
Address Cache Cache
Back
1 2
3
Design Considerations
Look at the system cost, not just the amount of processor logic
Weigh the cost of full symmetric dual-issue vs. asymmetric
Instruction occurrences, load/store vs. computation
Better support signal processing and media routines
Reduce the load-use penalty
Accelerate ‘tight loops’
Importance of interface to memory
Streaming of cached and uncached data
Outstanding transactions
Hiding memory latencies
Support for coherence with accelerators and DMA
4
Microarchitecture Overview
Variable length, out of order, 8 stage superscalar pipeline
Advanced prefetch with parallel branch pipeline enabling early branch
prediction and resolution
Multi-issued into:
Primary data processing pipeline
Secondary full data processing pipeline
Load-store pipeline
Compute engine (FPU/NEON) pipeline
Speculative execution
Supporting virtual renaming of ARM physical registers removing
pipeline stalls due to data dependencies
Removing dependency on compiler instruction scheduling
Increasing processor utilization and hiding of memory latencies
Increased performance by hardware unrolling of code loops
Reducing interrupt latency by speculative entry to ISR
5
Cortex A9 Microarchitecture (single core variant)
IRQ/FIQ
CoreSight Out of order
ALU/MUL
PL390
Coresight / Virtual and multi-issue with ALU-1
Debug Access Port
JTAG physical speculation
Interrupt
Debug Profiling Monitor Block
register pool Controller
ALU
ALU-1
Instruction queue
OoO …
Dispatcher
Register
Dual-instruction Rename stage Write
FPU/NEON back
Cortex A9 Decode stage
Branch stage
Single core Monitor
Processor 3+1 Dispatch
Address
stage
Instruction Prediction
queue queue
Auto-prefetcher
Prefetch stage Load-Store Unit µTLB
Fast-loop Branch Prediction Store Buffer
Program
look-aside quad-slot with forwarding MMU
Global History Buffer Trace
Instruction BR-Target Addr Cache Unit
cache Return Stack Data Cache
Coresight
PL310 L2 Cache Control Trace
L2 Cache Bus Interface Unit (BIU)
ECC RAMs
Controller Master interface Secondary master (with filtering)
AMBA 3 AXI 64bit
6
Cortex A9 Pipeline Structure
1 1 1 2 1 1 1 to 3 1 Clocks in stage
Br
Dispatching up to 4 instructions per clock cycle
With out of order dispatch from the issue queue to ensure maximum
utilization of subsequent pipeline stages
With up to 7 instructions completing per cycle (including FPU/NEON)
Compute engine (FPU/NEON) executing in parallel
7
Branch Prediction Scheme
Branch prediction using a Branch
Target Address Cache (BTAC) and
Global History Buffer (GHB)
GHB
Early in the pipeline to reduce the
BTAC
branch mispredict penalty.
Return
Benchmarked to confirm final size
stack
BTAC is 256 entries x 2ways to
PF0 PF1 PF2 PF3 1024 entries x 2ways
Prediction queue GHB between 1K to 4K entries
Return stack of 4 x 32bits
Slot0 Slot1 Slot2 Slot3
Optimization of small loops
Instruction
Instruction queue
Removal of branch penalty
Cache
No RAM power consumption
Branch folding is done in Integer
Core
8
Register Renaming
Removes hazards in the pipeline:
name-dependencies (WAW)
anti-dependencies (WAR)
Simplifies stalling logic, as
instructions issued to two different
pipelines can complete
irrespective of program order
Avoids the need for stalling logic
for branch resolution and
load/store exception resolution
Register renaming can fully unroll
in hardware some simple code
loops
9
Decode and Issue Logic
Dual symmetric decoders for ARM
and Thumb2
Supports ARM, Thumb2, Jazelle
RCT and Jazelle DBX instruction
sets
Single decoder for Jazelle DBX
Issue fed two instructions per cycle
with dispatching up to 4 instructions
Out of order selection of instructions Branch Resolution
Less sensitive to data latencies (eg.
L1 cache misses)
Branches resolved separately from VFP/NEON
instruction issue LS
10
Data Processing Pipeline
Dual symmetric arithmetic execution
units
Full forwarding path support
Executes speculative instructions
Includes additional 3-stage multiplier
pipeline
Variable length
From 1 to 3 cycle execution time
Most operations complete in 1 cycle
The number of DP/LS/MUL pipelines
carefully balanced to optimize
performance against silicon cost with
power efficiency targets
11
Integration of Compute Engines
Cortex A9 supports multiple
compute engines
VFPv3 Floating-Point Unit
NEON Advanced SIMD
(includes FP Unit)
Early in the pipeline to reduce
compare-to-branch cost
Single issue dispatch
Supports speculation over
branches and condition codes
Floating-Point Unit integration
Issued in parallel with core
pipelines
12
Data Side Summary
PIPT data cache
No need for page colouring
Load store unit
Supports up to four simultaneous
requests
Speculative load/stores support
Internal data forwarding between
dependent instructions
Merging store buffer with forwarding
Reduces power consumption
Reduces bus transactions
Bus Interface Unit
Supports up to four outstanding line
fill requests
Adaptative R/W cache allocation
Automatic data prefetcher
Supports cache coherency
13
Cortex A9 Microarchitecture (single core variant)
IRQ/FIQ
CoreSight Out of order
ALU/MUL
PL390
Coresight / Virtual and multi-issue with ALU-1
Debug Access Port
JTAG physical speculation
Interrupt
Debug Profiling Monitor Block
register pool Controller
ALU
ALU-1
Instruction queue
OoO …
Dispatcher
Register
Dual-instruction Rename stage Write
FPU/NEON back
Cortex A9 Decode stage
Branch stage
Single core Monitor
Processor 3+1 Dispatch
Address
stage
Instruction Prediction
queue queue
Auto-prefetcher
Prefetch stage Load-Store Unit µTLB
Fast-loop Branch Prediction Store Buffer
Program
look-aside quad-slot with forwarding MMU
Global History Buffer Trace
Instruction BR-Target Addr Cache Unit
cache Return Stack Data Cache
Coresight
PL310 L2 Cache Control Trace
L2 Cache Bus Interface Unit (BIU)
ECC RAMs
Controller Master interface Secondary master (with filtering)
AMBA 3 AXI 64bit
14
Project Design Improvements
Baseline, ARM1176F
500MHz in TSMC90G, Advantage 2.0
Factor Improvement
(9 track), 32K caches
1.2 DMIPS/MHz 1.5
0.34 mW/DMIP
200 DMIPS / mm2 1.0
65nm and 45nm process further
enhancing device capabilities
Capable of solution over 1GHz 0.5
ARM1176
ARM1176
ARM1176
Efficiency
Power
(DMIPS/mm2)
Performance Cost
Dhrystone
Results taken from current trials
on pre-release Cortex A9 data
15
Summary
Cortex A9 is the 2008 Cortex premium application processor
Provides a migration path for ARM9/ARM11 designs to ARMv7 with increased
processor efficiency and additional performance headroom
Brings together the architectural features of TrustZone and Thumb2 and
multicore processing
Offering significant peak performance through MPCore technology
Capable of over 8000 aggregate DMIPs using 65nm foundry process
While offering increased performance within a mobile 200mW budget
Supporting options for optimized FPU and NEON compute-engines to
increase application specific performance
Utilized traditional EDA synthesis design tools for deployment rapid
deployment of low-power and scalable performance processor designs
16
Get documents about "