Power Savings in Embedded Processors through Decode Filter Cache

Document Sample
Power Savings in Embedded Processors through Decode Filter Cache Powered By Docstoc
					Power Savings in Embedded Processors
     through Decode Filter Cache



    Weiyu Tang, Rajesh Gupta, Alex Nicolau
Overview
•    Introduction
•    Related Work
•    Decode Filter Cache
•    Results and Conclusion




Weiyu Tang, Rajesh Gupta, Alex Nicolau
                                         2
CECS, University of California, Irvine
Introduction
•    Instruction delivery is a major power consumer in
     embedded systems
       • Instruction fetch
              – 27% processor power in StrongARM
       • Instruction decode
              – 18% processor power in StrongARM
•    Goal
      • Reduce power in instruction delivery with minimal
        performance penalty




Weiyu Tang, Rajesh Gupta, Alex Nicolau
                                                            3
CECS, University of California, Irvine
Related Work
• Architectural approaches to reduce
  instruction fetch power
     • Store instructions in small and power efficient storages
     • Examples:
          – Line buffers
          – Loop cache
          – Filter cache




Weiyu Tang, Rajesh Gupta, Alex Nicolau
                                                             4
CECS, University of California, Irvine
Related Work
• Architectural approaches to reduce
  instruction decode power
     • Avoid unnecessary decoding by saving decoded
       instructions in a separate cache
     • Trace cache
          – Store decoded instructions in execution order
          – Fixed cache access order
              • Instruction cache is accessed on trace cache misses
          – Targeted for high-performance processors
              • Increase fetch bandwidth
              • Require sophisticated branch prediction mechanisms
          – Drawbacks
              • Not power efficient as the cache size is large



Weiyu Tang, Rajesh Gupta, Alex Nicolau
                                                                      5
CECS, University of California, Irvine
Related Work
   • Micro-op cache
        – Store decoded instructions in program order
        – Fixed cache access order
            • Instruction cache and micro-op cache are accessed in
              parallel to minimize micro-op cache miss penalty
        – Drawbacks
            • Need extra stage in the pipeline, which increases
              misprediction penalty
            • Require a branch predictor
            • Per access power is large
                   – Micro-op cache size is large
                   – Power consumption from both micro-op cache and
                     instruction cache




Weiyu Tang, Rajesh Gupta, Alex Nicolau
                                                                      6
CECS, University of California, Irvine
Decode Filter Cache
•    Targeted processors
       •    Single issue, In-order execution
•    Research goal
       •    Use a small (and power efficient) cache to save decoded
            instructions
       •    Reduce instruction fetch power and decode power
            simultaneously
       •    Reduce power without sacrificing performance
•    Problems to deal with
       •    What kind of cache organization to use
       •    Where to fetch instructions as instructions can be provided
            from multiple sources
       •    How to minimize decode filter cache miss latency




Weiyu Tang, Rajesh Gupta, Alex Nicolau
                                                                          7
CECS, University of California, Irvine
          Decode Filter Cache

                              Decode             5
                            filter cache
                                                              2             3         4
                                             1
                                 fetch               decode       execute       mem       writeback



 fetch
address
            predictor          Line buffer


                                 I-cache




          Weiyu Tang, Rajesh Gupta, Alex Nicolau
                                                                                             8
          CECS, University of California, Irvine
Decode Filter Cache
• Decode filter cache organization
    • Problems with traditional cache organization
         – The decoded instruction width varies
         – Save all the decoded instructions will waste cache space
    • Our approach
         – Instruction classification
             • Classify instructions into cacheable and uncacheable
                depending on instruction width distribution
             • Use a “cacheable ratio” to balance the cache utilization vs.
                the number of instructions that can be cached
         – Sectored cache organization
             • Each instruction can be cached independently of
                neighboring lines
             • Neighboring lines share a tag to reduce cache tag store
                cost


 Weiyu Tang, Rajesh Gupta, Alex Nicolau
                                                                         9
 CECS, University of California, Irvine
Decode Filter Cache
• Where to fetch instructions
    • Instructions can be provided from one of the following
      sources
         – Line buffer
         – Decode filter cache
         – Instruction cache
    • Predictive order for instruction fetch
         – For power efficiency, either the decode filter cache or the line
           buffer is accessed first when an instruction is likely to hit
         – To minimize decode filter cache miss penalty, the instruction
           cache is accessed directly when the decode filter cache is
           likely to miss




 Weiyu Tang, Rajesh Gupta, Alex Nicolau
                                                                              10
 CECS, University of California, Irvine
Decode Filter Cache
•   Prediction mechanism
     • When next fetch address and current address map to
       the same cache line
           – If current fetch source is line buffer, the next fetch source
             remain the same
           – If current fetch source is decode filter cache and the
             corresponding instruction is valid, the next fetch source
             remain the same
           – Otherwise, the next fetch source is instruction cache
      • When fetch address and current address map to
        different cache lines
              – Predict based on next fetch prediction table, which utilizes
                 control flow predictability
              – If the tag of current fetch address and the tag of the predicted
                 next fetch address are same, next fetch source is decode
                 filter cache
Weiyu Tang, Rajesh Gupta, Alex Nicolau
              – Otherwise, next fetch source is instruction cache
CECS, University of California, Irvine
                                                                           11
Results
• Simulation setup
     • Media Benchmark
     • Cache size
          – 512B decode filter cache, 16KB instruction cache, 8KB data cache.
     • Configurations investigated


                       Line      Decode filter   Cacheable   Instruction      Use
                      buffer       cache           ratio     filter cache   Predictor
         DF_0.9         X                X          0.9                        X
         DF_0.8         X                X          0.8                        X
         DF_0.7         X                X          0.7                        X
         DF_0.6         X                X          0.6                        X
         DF_NO          X                X          0.9
            IF                                                    X



Weiyu Tang, Rajesh Gupta, Alex Nicolau
                                                                                        12
CECS, University of California, Irvine
Results: % reduction in I-cache fetches

  100
   90
   80
   70
   60
   50
   40
   30                                            IF    DF_NO       DF_0.9
   20
   10
    0
                       cjg

                             d jg




                                                                             c
            c

                   c




                                    g st




                                                          a
                                                                       _c
                                                  en c
                                           dec




                                                                                            dec

                                                                                            enc
                                                                                    pic
                                                                      _d




                                                                                                  avg
        _d e

                _en




                                                                            epi
                                                            t
                                                         ras

                                                                    cm

                                                                    cm



                                                                                  u ne

                                                                                          pw

                                                                                          pw
                                          g_

                                                  g_
   7 21

            7 21




                                                            adp

                                                                adp
                                       mp

                                               mp




Weiyu Tang, Rajesh Gupta, Alex Nicolau
                                                                                                        13
CECS, University of California, Irvine
Results: % reduction in instruction decodes


    100
     90
     80
     70
     60
     50
     40
     30                                             DF_NO         DF_0.9
     20
     10
      0
                         cjg

                               d jg




                                                                               c
              c

                     c




                                      g st




                                                            a

                                                                         _c
                                                    en c
                                             dec




                                                                                              dec

                                                                                              enc
                                                                        _d



                                                                                      pic




                                                                                                    avg
          _d e

                  _en




                                                                              epi
                                                              t
                                                           ras

                                                                      cm

                                                                      cm



                                                                                    u ne

                                                                                            pw

                                                                                            pw
                                            g_

                                                    g_
      7 21

              7 21




                                                              adp

                                                                  adp
                                         mp

                                                 mp




  Weiyu Tang, Rajesh Gupta, Alex Nicolau
                                                                                                          14
  CECS, University of California, Irvine
Results: normalized delay

 1.15                                                       IF     DF_NO      DF_0.9

 1.10

 1.05

 1.00

 0.95

 0.90
                       cjg

                             d jg




                                                                             c
            c

                   c




                                    g st




                                                          a
                                                                       _c
                                                  en c
                                           dec




                                                                                            dec

                                                                                            enc
                                                                      _d



                                                                                    pic




                                                                                                  avg
        _d e

                _en




                                                                            epi
                                                            t
                                                         ras

                                                                    cm

                                                                    cm



                                                                                  u ne

                                                                                          pw

                                                                                          pw
                                          g_

                                                  g_
   7 21

            7 21




                                                            adp

                                                                adp
                                       mp

                                               mp




Weiyu Tang, Rajesh Gupta, Alex Nicolau
                                                                                                    15
CECS, University of California, Irvine
Results: % reduction in processor power

 45
 40
 35
 30
 25
 20
 15
 10                                   IF        DF_0.6         DF_0.7           DF_0.8       DF_0.9
  5
  0
                             d jg




                                                                                                       d ec
                       cjg




                                                                                                                       avg
                                    g st




                                                                                 _d




                                                                                               pic




                                                                                                                en c
                                               ec




                                                                                         c
                                                                         _c
                                                                 ta
          c




                                                        nc
                   c
      _d e




                                                                                      epi
                _en




                                                             r as
                                             g _d




                                                                       cm

                                                                               cm
                                                      g _e




                                                                                             u ne

                                                                                                     pw

                                                                                                              pw
  7 21

              7 21




                                                                      adp

                                                                              adp
                                           mp

                                                    mp




Weiyu Tang, Rajesh Gupta, Alex Nicolau
                                                                                                                             16
CECS, University of California, Irvine
Conclusion
• There is a basic tradeoff between
   • no. of the instructions cached as in instruction caches, and
   • greater savings in power by reducing decoding, fetch work
     (as in decode caches).
• We tip this balance in the favor of decode cache by a
  coordinated operation of
   • instruction classification/selective decoding (into smaller
     widths)
   • sectored caches built around this classification
• The results show
   • Average 34% reduction in processor power
       – 50% more effective in power savings than an instruction filter
         cache
   • Less than 1% performance degradation due to effective
       prediction mechanism
 Weiyu Tang, Rajesh Gupta, Alex Nicolau
                                                                 17
 CECS, University of California, Irvine