; Safe RTL Annotations for Low Power Microprocessor Design
Documents
Resources
Learning Center
Upload
Plans & pricing Sign in
Sign Out
Your Federal Quarterly Tax Payments are due April 15th Get Help Now >>

Safe RTL Annotations for Low Power Microprocessor Design

VIEWS: 9 PAGES: 29

  • pg 1
									Safe RTL Annotations for Low
Power Microprocessor Design
                   Outline
• Power Dissipation in Hardware Circuits
• Instruction-driven Slicing to attain lower
  power dissipation
  – Automatically annotates microprocessor
    description at the Register Transfer Level and
    Architectural level
• Correctness of the introduced annotations
• Case studies
             Power Dissipation
P = 1/2 ¢ C ¢ V2DD ¢ f ¢ N   + QSC ¢ VDD ¢ f ¢ N + Ileak ¢
                             VDD
• Switching activity power dissipation
   – To charge and discharge nodes
• Short Circuit power dissipation
   – High only for output drivers, clock buffers
• Static power dissipation
   – Due to leakage current
Switching Activity Power Dissipation
 • Reduce the squared term VDD
   – Leads to exponential increase in Ileak
 • Host of techniques to reduce switching
   power at the gate level
   – Clock gating
 • Relatively much lesser at the RTL
   – Use program structure and dataflow
     information available at that level of
     abstraction
      Instruction-driven Slice
• An instruction-driven slice of a
  microprocessor design is
  – all the relevant circuitry of the design
    required to completely execute a specific
    instruction
  – Parts of the decode, execute, writeback
    etc. blocks
• Cone of influence of the semantics of
  the instruction
     Instruction-driven Slicing
• Given a microprocessor design and an
  instruction
  – Identify the instruction-driven slice
  – Shut off the rest of the circuitry
• This might include
  – Gating out parts of different blocks
  – Gating out floating point units during
    integer ALU execution
  – Turning off certain FSMs in different
    control blocks since exact constraints on
    their inputs are available due to
    instruction-driven slicing
         Algorithm (High Level)
• Algorithm instruction-driven-slicing.
  Begin
     • Inputs: vRTL (Verilog RTL), insts (instructions)
     • Output: aRTL (Annotated RTL)
  – Parse vRTL to obtain the Abstract Syntax Program
    Graph (ASPG)
  – For each instruction I in insts repeat
     • Slice the ASPG for instruction I
     • Traverse the ASPG
     • Add annotation variables if such a block is found
     • If a particular flop is already gated, then
        add the current annotation in an optimal fashion
     • Return the annotated ASPG
  – Generate Verilog code (aRTL) for the annotated ASPG
  End.
 Instructions as LTL Properties
• Let I = i1 Æ X i2 Æ XX i3 ... Xn-1 in be an
  instruction written as an LTL property,
  such that ir represents the conditions
  for the instruction I on clock cycle r.

• i1 represents the instruction word.
     RISC Pipeline (OR1200)
• 5 stage RISC pipeline implementation
• Condition for slicing on ADDC instruction
  – i1: ((icpu_dat_i[31:26]==6’b 111000) Æ
         (!rst) Æ (!flushpipe) Æ (!if_freeze))
  – i2: (!id_freeze)
  – i3: (!ex_freeze)
  – i4: (!mem_freeze)
  – i5: (!wb_freeze)

• I = i1 Æ X i2 Æ X2i3 Æ X3i4 Æ X4i5
      OR1200 ADDC Instruction
• Introduces five variables:
  –   iADDC_if = i1
  –   iADDC_id = #1 iADDC_if Æ i2
  –   iADDC_ex = #1 iADDC_id Æ i3
  –   iADDC_mem = #1 iADDC_ex Æ i4
  –   iADDC_wb = #1 iADDC_mem Æ i5
or1200_ctrl.lsu_op
or1200_ctrl.pre_branch_op
        Correct Annotations
• Notion of correctness
  – Original RTL and the annotated RTL should
    be functionally equivalent under all
    conditions
• Correctness theorem
  (defthm or1200_slicing_correct
     (equal (or1200_cpu n)
            (or1200_cpu_sliced n)))
      ACL2 Theorem Prover
• First order logic general purpose
  theorem prover
• Breakdown the theorem into sub-goals
• Many engines work on the sub-goals and
  will either prove them or break them
  down further and add to the central
  pool of goals to be proved
• Success story in Hardware
  – Verified FDIV in the AMD processors
        Proof Methodology
• The RTL is a shallow embedding in ACL2
• Convert Verilog RTL into ACL2RTL
• We have created a large RTL library to
  recognize as well as analyze ACL2RTL
• Slicing is done on the Verilog code
• Both original and annotated Verilog are
  converted into ACL2 and we construct
  the functional equivalence proof in ACL2
Verilog to ACL2
               Methodology
• In order to demonstrate our technique
  – We have incorporated instruction-driven slicing as
    part of the traditional design flow
  – The vRTL model is annotated to obtain the aRTL
    model
  – Synopsys Design Environment has been sufficiently
    modified to accept the aRTL, SPEC2000
    benchmarks and power process parameters and
    estimate the power dissipation due to switching
    activity
  – The annotated Architectural model is fed to the
    SimpleScalar simulator with the Wattch power
    estimator to estimate the power dissipation
Methodology
       Experiment: OR1200
• We have used our tool-chain to test our
  methodology on OR1200
  – OR1200 is a pipelined microprocessor
    implementing the OpenRISC ISA.
  – 5-stage integer pipeline with single
    instruction issue per cycle
  – We have annotated both the RTL and the
    architectural models of OR1200
OR1200: single instruction issue
  pipelined microprocessor
   OR1200 Power Gain Results




• Results are shown after annotating the
  – RTL (left) and Architectural (Right) models
  – For un-sliced and sliced on 1, 4, 10 instructions
  – For SPECINT2000 benchmarks
• Power dissipation decreases consistently
          OR1200 Results (contd.)

                                                   Fig.2a




                                                   Fig.2b


                     Fig. 1

•   Power gains are consistently good (Fig. 1)
•   Power gains far outperform area losses
    (Fig 1)                                        Fig.2c
•   Flop distribution shown before slicing
    (Fig. 2a) after slicing on add (Fig. 2b) and
    after slicing on load (Fig. 2c)
         Experiment: PUMA
• We have used our tool-chain to test our
  methodology on PUMA
  – PUMA is a dual-issue, out-of-order super-
    scalar, fixed-point PowerPC core
  – We have annotated both the RTL and the
    architectural models of PUMA
PUMA: a fixed point PowerPC core
     PUMA Power Gain Results




• Results are shown after annotating the
  – RTL (left) and Architectural (Right) models
  – For un-sliced and sliced on 1, 4, 10 instructions
  – For SPECINT2000 benchmarks
• Power dissipation decreases consistently
                                                                                           PUMA Results (contd.)
                                                                                           PUMA-RTL Power vs. Delay

                                       1.2


                                                            1
         %-age Power gain, Area loss




                                       0.8




Fig. 1
                                       0.6
                                                                                                                                                                Power
                                                                                                                                                                Delay
                                                                                                                                                                        Fig.3a
                                       0.4


                                       0.2


                                                            0
                                                                                                d




                                                                                                                               d




                                                                                                                                                    d
                                                                                 d




                                                                                              ce




                                                                                                                             ce




                                                                                                                                               lic e
                                                                                   e
                                                                               slic




                                                                                               li




                                                                                                                            li
                                                                                           1- S




                                                                                                                        4- S




                                                                                                                                              -S
                                                                            Un




                                                                                                                                           10
                                                                                                    Instruction-driven slicing




                                                                                           PUMA-RTL Power vs. Area

                                                                     1.15


                                                                      1.1
                                       %-age Power gain, Area loss




Fig. 2                                                               1.05
                                                                                                                                                                        Fig.3b
                                                                                                                                                        Power
                                                                       1
                                                                                                                                                        Area

                                                                     0.95


                                                                      0.9


                                                                     0.85
                                                                                                d




                                                                                                                         d




                                                                                                                                           d
                                                                                       d




                                                                                              ce




                                                                                                                       ce




                                                                                                                                      lic e
                                                                                       e
                                                                                   slic




                                                                                               li




                                                                                                                        li
                                                                                           1- S




                                                                                                                    4- S




                                                                                                                                      -S
                                                                                Un




                                                                                                                                   10




                                                                                                Instruction-driven slicing




   •     Power gains are good upon slicing for a few
         instructions (~7) before delay losses start
         dominating (Fig. 1)
   •     Power gains far outperform area losses (Fig 2)                                                                                                                 Fig.3c
   •     Flop distribution shown before slicing (Fig. 3a)
         after slicing on add (Fig. 3b) and after slicing
         on load (Fig. 3c)
Comparing OR1200 and PUMA
             Conclusions
• Proposed Instruction-driven Slicing as a
  new technique to automatically reduce
  power dissipation
• Implemented the methodology of
  incorporating instruction-driven slicing
  into the design flow tool-chain
• Inserting these annotations preserves
  the functionality of the circuit
      Conclusions (continued)
• This technique seems most applicable to
  single-issue multi-staged pipelined machines.
• When there are multiple instructions in-flight
  in the same pipeline stage, the gains of a
  single-instruction-abstraction are lost.
• Graphics processors, various embedded
  applications are more often better suited for
  this technique than general purpose out-of-
  order superscalars.

								
To top