Safe RTL Annotations for Low Power Microprocessor Design Outline • Power Dissipation in Hardware Circuits • Instruction-driven Slicing to attain lower power dissipation – Automatically annotates microprocessor description at the Register Transfer Level and Architectural level • Correctness of the introduced annotations • Case studies Power Dissipation P = 1/2 ¢ C ¢ V2DD ¢ f ¢ N + QSC ¢ VDD ¢ f ¢ N + Ileak ¢ VDD • Switching activity power dissipation – To charge and discharge nodes • Short Circuit power dissipation – High only for output drivers, clock buffers • Static power dissipation – Due to leakage current Switching Activity Power Dissipation • Reduce the squared term VDD – Leads to exponential increase in Ileak • Host of techniques to reduce switching power at the gate level – Clock gating • Relatively much lesser at the RTL – Use program structure and dataflow information available at that level of abstraction Instruction-driven Slice • An instruction-driven slice of a microprocessor design is – all the relevant circuitry of the design required to completely execute a specific instruction – Parts of the decode, execute, writeback etc. blocks • Cone of influence of the semantics of the instruction Instruction-driven Slicing • Given a microprocessor design and an instruction – Identify the instruction-driven slice – Shut off the rest of the circuitry • This might include – Gating out parts of different blocks – Gating out floating point units during integer ALU execution – Turning off certain FSMs in different control blocks since exact constraints on their inputs are available due to instruction-driven slicing Algorithm (High Level) • Algorithm instruction-driven-slicing. Begin • Inputs: vRTL (Verilog RTL), insts (instructions) • Output: aRTL (Annotated RTL) – Parse vRTL to obtain the Abstract Syntax Program Graph (ASPG) – For each instruction I in insts repeat • Slice the ASPG for instruction I • Traverse the ASPG • Add annotation variables if such a block is found • If a particular flop is already gated, then add the current annotation in an optimal fashion • Return the annotated ASPG – Generate Verilog code (aRTL) for the annotated ASPG End. Instructions as LTL Properties • Let I = i1 Æ X i2 Æ XX i3 ... Xn-1 in be an instruction written as an LTL property, such that ir represents the conditions for the instruction I on clock cycle r. • i1 represents the instruction word. RISC Pipeline (OR1200) • 5 stage RISC pipeline implementation • Condition for slicing on ADDC instruction – i1: ((icpu_dat_i[31:26]==6’b 111000) Æ (!rst) Æ (!flushpipe) Æ (!if_freeze)) – i2: (!id_freeze) – i3: (!ex_freeze) – i4: (!mem_freeze) – i5: (!wb_freeze) • I = i1 Æ X i2 Æ X2i3 Æ X3i4 Æ X4i5 OR1200 ADDC Instruction • Introduces five variables: – iADDC_if = i1 – iADDC_id = #1 iADDC_if Æ i2 – iADDC_ex = #1 iADDC_id Æ i3 – iADDC_mem = #1 iADDC_ex Æ i4 – iADDC_wb = #1 iADDC_mem Æ i5 or1200_ctrl.lsu_op or1200_ctrl.pre_branch_op Correct Annotations • Notion of correctness – Original RTL and the annotated RTL should be functionally equivalent under all conditions • Correctness theorem (defthm or1200_slicing_correct (equal (or1200_cpu n) (or1200_cpu_sliced n))) ACL2 Theorem Prover • First order logic general purpose theorem prover • Breakdown the theorem into sub-goals • Many engines work on the sub-goals and will either prove them or break them down further and add to the central pool of goals to be proved • Success story in Hardware – Verified FDIV in the AMD processors Proof Methodology • The RTL is a shallow embedding in ACL2 • Convert Verilog RTL into ACL2RTL • We have created a large RTL library to recognize as well as analyze ACL2RTL • Slicing is done on the Verilog code • Both original and annotated Verilog are converted into ACL2 and we construct the functional equivalence proof in ACL2 Verilog to ACL2 Methodology • In order to demonstrate our technique – We have incorporated instruction-driven slicing as part of the traditional design flow – The vRTL model is annotated to obtain the aRTL model – Synopsys Design Environment has been sufficiently modified to accept the aRTL, SPEC2000 benchmarks and power process parameters and estimate the power dissipation due to switching activity – The annotated Architectural model is fed to the SimpleScalar simulator with the Wattch power estimator to estimate the power dissipation Methodology Experiment: OR1200 • We have used our tool-chain to test our methodology on OR1200 – OR1200 is a pipelined microprocessor implementing the OpenRISC ISA. – 5-stage integer pipeline with single instruction issue per cycle – We have annotated both the RTL and the architectural models of OR1200 OR1200: single instruction issue pipelined microprocessor OR1200 Power Gain Results • Results are shown after annotating the – RTL (left) and Architectural (Right) models – For un-sliced and sliced on 1, 4, 10 instructions – For SPECINT2000 benchmarks • Power dissipation decreases consistently OR1200 Results (contd.) Fig.2a Fig.2b Fig. 1 • Power gains are consistently good (Fig. 1) • Power gains far outperform area losses (Fig 1) Fig.2c • Flop distribution shown before slicing (Fig. 2a) after slicing on add (Fig. 2b) and after slicing on load (Fig. 2c) Experiment: PUMA • We have used our tool-chain to test our methodology on PUMA – PUMA is a dual-issue, out-of-order super- scalar, fixed-point PowerPC core – We have annotated both the RTL and the architectural models of PUMA PUMA: a fixed point PowerPC core PUMA Power Gain Results • Results are shown after annotating the – RTL (left) and Architectural (Right) models – For un-sliced and sliced on 1, 4, 10 instructions – For SPECINT2000 benchmarks • Power dissipation decreases consistently PUMA Results (contd.) PUMA-RTL Power vs. Delay 1.2 1 %-age Power gain, Area loss 0.8 Fig. 1 0.6 Power Delay Fig.3a 0.4 0.2 0 d d d d ce ce lic e e slic li li 1- S 4- S -S Un 10 Instruction-driven slicing PUMA-RTL Power vs. Area 1.15 1.1 %-age Power gain, Area loss Fig. 2 1.05 Fig.3b Power 1 Area 0.95 0.9 0.85 d d d d ce ce lic e e slic li li 1- S 4- S -S Un 10 Instruction-driven slicing • Power gains are good upon slicing for a few instructions (~7) before delay losses start dominating (Fig. 1) • Power gains far outperform area losses (Fig 2) Fig.3c • Flop distribution shown before slicing (Fig. 3a) after slicing on add (Fig. 3b) and after slicing on load (Fig. 3c) Comparing OR1200 and PUMA Conclusions • Proposed Instruction-driven Slicing as a new technique to automatically reduce power dissipation • Implemented the methodology of incorporating instruction-driven slicing into the design flow tool-chain • Inserting these annotations preserves the functionality of the circuit Conclusions (continued) • This technique seems most applicable to single-issue multi-staged pipelined machines. • When there are multiple instructions in-flight in the same pipeline stage, the gains of a single-instruction-abstraction are lost. • Graphics processors, various embedded applications are more often better suited for this technique than general purpose out-of- order superscalars.