Docstoc

Safe RTL Annotations for Low Power Microprocessor Design

Document Sample
Safe RTL Annotations for Low Power Microprocessor Design Powered By Docstoc
					Safe RTL Annotations for Low
Power Microprocessor Design
                   Outline
• Power Dissipation in Hardware Circuits
• Instruction-driven Slicing to attain lower
  power dissipation
  – Automatically annotates microprocessor
    description at the Register Transfer Level and
    Architectural level
• Correctness of the introduced annotations
• Case studies
             Power Dissipation
P = 1/2 ¢ C ¢ V2DD ¢ f ¢ N   + QSC ¢ VDD ¢ f ¢ N + Ileak ¢
                             VDD
• Switching activity power dissipation
   – To charge and discharge nodes
• Short Circuit power dissipation
   – High only for output drivers, clock buffers
• Static power dissipation
   – Due to leakage current
Switching Activity Power Dissipation
 • Reduce the squared term VDD
   – Leads to exponential increase in Ileak
 • Host of techniques to reduce switching
   power at the gate level
   – Clock gating
 • Relatively much lesser at the RTL
   – Use program structure and dataflow
     information available at that level of
     abstraction
      Instruction-driven Slice
• An instruction-driven slice of a
  microprocessor design is
  – all the relevant circuitry of the design
    required to completely execute a specific
    instruction
  – Parts of the decode, execute, writeback
    etc. blocks
• Cone of influence of the semantics of
  the instruction
     Instruction-driven Slicing
• Given a microprocessor design and an
  instruction
  – Identify the instruction-driven slice
  – Shut off the rest of the circuitry
• This might include
  – Gating out parts of different blocks
  – Gating out floating point units during
    integer ALU execution
  – Turning off certain FSMs in different
    control blocks since exact constraints on
    their inputs are available due to
    instruction-driven slicing
         Algorithm (High Level)
• Algorithm instruction-driven-slicing.
  Begin
     • Inputs: vRTL (Verilog RTL), insts (instructions)
     • Output: aRTL (Annotated RTL)
  – Parse vRTL to obtain the Abstract Syntax Program
    Graph (ASPG)
  – For each instruction I in insts repeat
     • Slice the ASPG for instruction I
     • Traverse the ASPG
     • Add annotation variables if such a block is found
     • If a particular flop is already gated, then
        add the current annotation in an optimal fashion
     • Return the annotated ASPG
  – Generate Verilog code (aRTL) for the annotated ASPG
  End.
 Instructions as LTL Properties
• Let I = i1 Æ X i2 Æ XX i3 ... Xn-1 in be an
  instruction written as an LTL property,
  such that ir represents the conditions
  for the instruction I on clock cycle r.

• i1 represents the instruction word.
     RISC Pipeline (OR1200)
• 5 stage RISC pipeline implementation
• Condition for slicing on ADDC instruction
  – i1: ((icpu_dat_i[31:26]==6’b 111000) Æ
         (!rst) Æ (!flushpipe) Æ (!if_freeze))
  – i2: (!id_freeze)
  – i3: (!ex_freeze)
  – i4: (!mem_freeze)
  – i5: (!wb_freeze)

• I = i1 Æ X i2 Æ X2i3 Æ X3i4 Æ X4i5
      OR1200 ADDC Instruction
• Introduces five variables:
  –   iADDC_if = i1
  –   iADDC_id = #1 iADDC_if Æ i2
  –   iADDC_ex = #1 iADDC_id Æ i3
  –   iADDC_mem = #1 iADDC_ex Æ i4
  –   iADDC_wb = #1 iADDC_mem Æ i5
or1200_ctrl.lsu_op
or1200_ctrl.pre_branch_op
        Correct Annotations
• Notion of correctness
  – Original RTL and the annotated RTL should
    be functionally equivalent under all
    conditions
• Correctness theorem
  (defthm or1200_slicing_correct
     (equal (or1200_cpu n)
            (or1200_cpu_sliced n)))
      ACL2 Theorem Prover
• First order logic general purpose
  theorem prover
• Breakdown the theorem into sub-goals
• Many engines work on the sub-goals and
  will either prove them or break them
  down further and add to the central
  pool of goals to be proved
• Success story in Hardware
  – Verified FDIV in the AMD processors
        Proof Methodology
• The RTL is a shallow embedding in ACL2
• Convert Verilog RTL into ACL2RTL
• We have created a large RTL library to
  recognize as well as analyze ACL2RTL
• Slicing is done on the Verilog code
• Both original and annotated Verilog are
  converted into ACL2 and we construct
  the functional equivalence proof in ACL2
Verilog to ACL2
               Methodology
• In order to demonstrate our technique
  – We have incorporated instruction-driven slicing as
    part of the traditional design flow
  – The vRTL model is annotated to obtain the aRTL
    model
  – Synopsys Design Environment has been sufficiently
    modified to accept the aRTL, SPEC2000
    benchmarks and power process parameters and
    estimate the power dissipation due to switching
    activity
  – The annotated Architectural model is fed to the
    SimpleScalar simulator with the Wattch power
    estimator to estimate the power dissipation
Methodology
       Experiment: OR1200
• We have used our tool-chain to test our
  methodology on OR1200
  – OR1200 is a pipelined microprocessor
    implementing the OpenRISC ISA.
  – 5-stage integer pipeline with single
    instruction issue per cycle
  – We have annotated both the RTL and the
    architectural models of OR1200
OR1200: single instruction issue
  pipelined microprocessor
   OR1200 Power Gain Results




• Results are shown after annotating the
  – RTL (left) and Architectural (Right) models
  – For un-sliced and sliced on 1, 4, 10 instructions
  – For SPECINT2000 benchmarks
• Power dissipation decreases consistently
          OR1200 Results (contd.)

                                                   Fig.2a




                                                   Fig.2b


                     Fig. 1

•   Power gains are consistently good (Fig. 1)
•   Power gains far outperform area losses
    (Fig 1)                                        Fig.2c
•   Flop distribution shown before slicing
    (Fig. 2a) after slicing on add (Fig. 2b) and
    after slicing on load (Fig. 2c)
         Experiment: PUMA
• We have used our tool-chain to test our
  methodology on PUMA
  – PUMA is a dual-issue, out-of-order super-
    scalar, fixed-point PowerPC core
  – We have annotated both the RTL and the
    architectural models of PUMA
PUMA: a fixed point PowerPC core
     PUMA Power Gain Results




• Results are shown after annotating the
  – RTL (left) and Architectural (Right) models
  – For un-sliced and sliced on 1, 4, 10 instructions
  – For SPECINT2000 benchmarks
• Power dissipation decreases consistently
                                                                                           PUMA Results (contd.)
                                                                                           PUMA-RTL Power vs. Delay

                                       1.2


                                                            1
         %-age Power gain, Area loss




                                       0.8




Fig. 1
                                       0.6
                                                                                                                                                                Power
                                                                                                                                                                Delay
                                                                                                                                                                        Fig.3a
                                       0.4


                                       0.2


                                                            0
                                                                                                d




                                                                                                                               d




                                                                                                                                                    d
                                                                                 d




                                                                                              ce




                                                                                                                             ce




                                                                                                                                               lic e
                                                                                   e
                                                                               slic




                                                                                               li




                                                                                                                            li
                                                                                           1- S




                                                                                                                        4- S




                                                                                                                                              -S
                                                                            Un




                                                                                                                                           10
                                                                                                    Instruction-driven slicing




                                                                                           PUMA-RTL Power vs. Area

                                                                     1.15


                                                                      1.1
                                       %-age Power gain, Area loss




Fig. 2                                                               1.05
                                                                                                                                                                        Fig.3b
                                                                                                                                                        Power
                                                                       1
                                                                                                                                                        Area

                                                                     0.95


                                                                      0.9


                                                                     0.85
                                                                                                d




                                                                                                                         d




                                                                                                                                           d
                                                                                       d




                                                                                              ce




                                                                                                                       ce




                                                                                                                                      lic e
                                                                                       e
                                                                                   slic




                                                                                               li




                                                                                                                        li
                                                                                           1- S




                                                                                                                    4- S




                                                                                                                                      -S
                                                                                Un




                                                                                                                                   10




                                                                                                Instruction-driven slicing




   •     Power gains are good upon slicing for a few
         instructions (~7) before delay losses start
         dominating (Fig. 1)
   •     Power gains far outperform area losses (Fig 2)                                                                                                                 Fig.3c
   •     Flop distribution shown before slicing (Fig. 3a)
         after slicing on add (Fig. 3b) and after slicing
         on load (Fig. 3c)
Comparing OR1200 and PUMA
             Conclusions
• Proposed Instruction-driven Slicing as a
  new technique to automatically reduce
  power dissipation
• Implemented the methodology of
  incorporating instruction-driven slicing
  into the design flow tool-chain
• Inserting these annotations preserves
  the functionality of the circuit
      Conclusions (continued)
• This technique seems most applicable to
  single-issue multi-staged pipelined machines.
• When there are multiple instructions in-flight
  in the same pipeline stage, the gains of a
  single-instruction-abstraction are lost.
• Graphics processors, various embedded
  applications are more often better suited for
  this technique than general purpose out-of-
  order superscalars.

				
DOCUMENT INFO
Shared By:
Stats:
views:9
posted:6/12/2012
language:
pages:29