Docstoc

slides

Document Sample
slides Powered By Docstoc
					   Improving Multiprocessor
Performance with Coarse-Grain
     Coherence Tracking
Jason F. Cantin, Mikko H. Lipasti, and James E. Smith
 International Symposium on Computer Architecture
                    June 7th, 2005
               Overview of Idea
Coarse-Grain Coherence Tracking:

1. Monitors coherence status of memory at a
   multi-line granularity

2. Uses the coarse-grain information to identify
   requests that don’t need a coherence
   broadcast

3. Sends these requests directly to memory
June 7, 2005         ISCA 2005                 2
                    Problem
                                       Broadcast Network
                                        Data Network

     NC        $
                    P                  P               P
     MC        P

      DRAM         DRAM               DRAM        DRAM

Snoop-based systems support a limited
number of processors
    – Limited broadcast bandwidth
    – Increasing memory latency
June 7, 2005              ISCA 2005                        3
                 Opportunity
• Some data requests don’t need a broadcast
    – Requests for non-shared data
    – Fetches of unmodified instructions
    – Write-backs

• Some non-data requests don’t need to leave the
  processor
    – Requests to upgrade copy, but not shared
    – Requests to flush copies, but not cached elsewhere
June 7, 2005             ISCA 2005                    4
                  Unnecessary Broadcasts
           100%                         93%

            80%
                                                       65%             67%
                          62%
                                                                                    Write-back
Requests




            60%
                                                                                    DCB

            40%                                                                     Write

                                                                                    I-Fetch
            20%                                                                     Read


             0%
                      Scientific   Multiprogrammed   Commercial   Arithmetic Mean



           June 7, 2005                        ISCA 2005                                    5
               Our Approach
• Identify requests that don’t need a broadcast

• Send data requests directly to memory
    – Reduce broadcast traffic
    – Reduce latency in some systems

• Avoid sending non-data requests externally
    – Further reduce broadcast traffic
    – Reduce latency

June 7, 2005             ISCA 2005                6
Coarse-Grain Coherence Tracking

• Memory is divided into coarse-grain regions
    – Aligned, power-of-two multiple of cache line size
    – Can range from two lines to a physical page


• A cache-like structure is added to each
  processor for monitoring coherence at the
  granularity of regions
    – Region Coherence Array (RCA)


 June 7, 2005             ISCA 2005                       7
Coarse-Grain Coherence Tracking
• Each entry has an address tag, state, and
  count of lines cached by the processor

• The state indicates if the processor and / or
  other processors are sharing / modifying lines
  in the region

• On cache misses, the region state is read to
  determine if a broadcast is necessary

 June 7, 2005         ISCA 2005                    8
Coarse-Grain Coherence Tracking

 • On snoops, the region state provides a
   response for the region
     – Piggy-backed onto the conventional response
     – Used to update other processors’ region state


 • RCA maintains inclusion over caches
     – When regions are evicted, their lines are evicted
     – RCA must respond correctly if region’s lines cached
     – Replacement algorithm uses line count
 June 7, 2005            ISCA 2005                      9
     Example: Conventional Snooping

                                                Network
                           Read: P0, 100002
                                     Read: P0, 100002
                                                        Invalid             Invalid
                               Tag      State
• P0 loads 100002
                              0010 Pending
                                    Invalid
                              0000 Exclusive                      0000     Invalid
    MISS                            $0                                  $1
                              0000    Invalid                     0000    Invalid
                                       Data 10000
• Snoop performed                      Load:     2

                    Data

• Response sent                      P0                                  P1

• Data transfer
                                  M0                                 M1
    June 7, 2005             ISCA 2005                                           10
  Coarse-Grain Coherence Tracking
Region Coherence
Array added; two                                          Network exclusive
                                                             P0 has
                                                                    Invalid, 10000
                                                                   Read: P ,Region 2
                                                                access to0regionNot Shared
lines per region                     Read: P0, 100002                       Invalid, Region Not Shared
                            Tag      State
• P0 loads 100002
                           0010 Pending
                                 Invalid
                           0000 Exclusive         000 Pending
                                                  001 Invalid
                                                        DI         0000     Invalid    000   Invalid
    MISS                         $0               RCA                    $1            RCA
                           0000    Invalid          Invalid
                                                  000              0000    Invalid       Invalid
                                                                                       000

• Snoop performed                    Data
                                                Load: 100002


• Response sent                   P0                                      P1

                    Data
• Data transfer
                               M0                                     M1
    June 7, 2005                             ISCA 2005                                       11
  Coarse-Grain Coherence Tracking
Region Coherence
Array added; two                                             Network
lines per region                                                 Exclusive region state,
                              Tag       State                    broadcast unnecessary
• P0 loads 110002
                             0010 Exclusive       001   DI         0000     Invalid   000    Invalid
    MISS, Region Hit              0$
                             0000 Exclusive
                             0011 Pending
                                   Invalid
                                                   RCA
                                                  000
                                                    Invalid        0000
                                                                          $1
                                                                           Invalid
                                                                                       RCA
                                                                                        Invalid
                                                                                      000

• Direct request sent                   Data
                                                Load: 110002


• Data transfer                     P0                                    P1
         Read: P0, 110002
                      Data

                                 M0                                    M1
    June 7, 2005                            ISCA 2005                                       12
  Coarse-Grain Coherence Tracking
Region Coherence
Array added; two                                   Network
                                                        Region not exclusive
                        Owned, Region Owned               RFO: P1, 100002
lines per region             Owned, Region Owned               anymore
                                                                RFO: P1, 100002
• P1 stores 100002
                             Pending
                              Invalid
                        0010 Exclusive    001    DI
                                                 DD       0010 Pending
                                                          0000 Modified
                                                                Invalid       001 Invalid
                                                                                    DD
                                                                              000 Pending
    MISS                   $ 0            RCA                   $1            RCA
                        0011 Exclusive      Invalid
                                          000             0000    Invalid       Invalid
                                                                              000
                                                                      Data
• Snoop performed                               Data                         Store: 100002
    Hits in P0 cache
                            P0                                   P1
• Response sent

• Data transfer
                            M0                                M1
    June 7, 2005                     ISCA 2005                                      13
                   Overhead
• Storage space needed for RCA
    – 3-6% storage overhead for cache


• Two bits needed in snoop response for region
  response

• Path to memory needed to avoid broadcasts
    – Simple with on-chip memory controllers
    – May leverage data network

June 7, 2005            ISCA 2005              14
                     Simulator
PHARMsim:

• Execution-driven simulator built on top of SimOS-PPC

• Four 4-way superscalar out-of-order processors

• Two-level hierarchy with split L1, unified L2 caches

• Separate address / data networks –similar to Fireplane

• Region Coherence Array with same sets/assoc. as L2


June 7, 2005               ISCA 2005                       15
                 Workloads
• Scientific
    – Ocean, Raytrace, Barnes

• Multiprogrammed
    – SPECint2000_rate


• Commercial
    – TPC-W, TPC-B, TPC-H, SPECweb99,
      SPECjbb2000

June 7, 2005             ISCA 2005      16
                             Broadcasts Avoided
           100%


           80%
                                                                                                           67%
                                                                                                                           56%          Write-back
Requests




           60%
                                                                                                                                        DCB

           40%                                                                                                                          Write

                                                                                                                                        I-Fetch
           20%                                                                                                                          Read

            0%
                  Oracle




                                                Oracle




                                                                              Oracle




                                                                                                           Oracle
                                          1KB




                                                                        1KB




                                                                                                     1KB




                                                                                                                                  1KB
                            256B

                                   512B




                                                         256B
                                                                512B




                                                                                       256B
                                                                                              512B




                                                                                                                    256B

                                                                                                                           512B
                           Scientific           Multiprogrammed                   Commercial               Arithmetic Mean


     June 7, 2005                                                      ISCA 2005                                                               17
  Snoop Traffic Reduction – Peak
                              8000
   Broadcasts / 100K Cycles




                              6000
                                              64%

                              4000
                                                                                           51%

                              2000                                    38%


                                 0
                                     Scientific          Multiprogrammed        Commercial

                                          Peak Traffic    Peak Traffic with 512B Regions

June 7, 2005                                        ISCA 2005                                    18
Snoop Traffic Reduction – Average
                              4000
   Broadcasts / 100K Cycles




                              2000




                                                  74%                                      47%
                                                                      86%
                                 0
                                     Scientific           Multiprogrammed        Commercial

                                       Average Traffic     Average Traffic with 512B Regions

June 7, 2005                                            ISCA 2005                                19
                                          Execution Time
                                                                                               91.2%
                            1.0
Normalized Execution Time




                            0.8


                            0.6


                            0.4


                            0.2


                            0.0
                                  Scientific       Multiprogrammed      Commercial       Arithmetic Mean

                                        Baseline    256B Regions     512B Regions    1KB Regions

     June 7, 2005                                        ISCA 2005                                   20
         Remaining Opportunity
• With 512B regions, ~10% of requests are
  broadcast unnecessarily

• A third of the 10% are region false sharing

• Half of the 10% miss in RCA
   – Potential for prefetching

June 7, 2005           ISCA 2005          21
                         Inclusion Overhead
                  100%

                  80%
  L2 Miss Ratio




                  60%

                  40%
                             +0.23%                                          +0.56%
                  20%
                                                        +0.04%
                   0%
                          Scientific           Multiprogrammed         Commercial

                                       Baseline miss rate   512B miss rate

--Regions with no lines cached replaced first
 June 7, 2005                               ISCA 2005                                 22
                  Conclusion
Coarse-Grain Coherence Tracking:

• Reduces broadcast traffic
    – Most data requests sent directly to memory

• Reduces latency
    – Many requests not sent to central arbitration point
    – Many non-data requests not sent externally

• Improves scalability and performance
June 7, 2005             ISCA 2005                      23
               The End




June 7, 2005     ISCA 2005   24
                                    Inclusion Evictions
                                                                               2 lines evicted
                     100%
                                                                               1 line evicted
  Region Evictions




                     80%                                                       0 lines evicted

                     60%

                     40%

                     20%

                      0%
                                                       ed
                                    ific




                                                                           l
                                                                        ia
                                                    m




                                                                        c
                                 nt




                                                                     er
                                                  am
                               ie




                                                                    m
                            Sc




                                                  gr




                                                                   om
                                                ro




                                                               C
                                             tip
                                           ul
                                           M




June 7, 2005                                           ISCA 2005                           25
                        Ordering
• Ordering point is now the Region Coherence Array
    – A direct request is ordered once it accesses the RCA


• Direct requests are serialized w.r.t. to snoop requests
    – A direct request occurs either before, or after a snoop
    – All must appear to access and update RCA atomically

• No two processors can have exclusive access to a
  region at the same time (no races)



June 7, 2005                 ISCA 2005                          26
     Comparison to RegionScout
                                CGCT                    RegionScout
    Optimization                Latency                    Power
Avoids broadcast for
                                  Yes                       Yes
non-shared data
Avoids broadcast for
                                  Yes                        No
clean data
Avoids tag lookups on
                                  No                   Yes –Like Jetty
snoops
Region state storage        Inclusive cache        Hash table, small cache
Region state transfer   2 bits in snoop response   1 bit in snoop response
  Region protocol               7 states             Effectively 4 states

   June 7, 2005                 ISCA 2005                           27
                                          Execution Time
                                                                                                   92.2%
                            1.0
                                                                                              91.2%
Normalized Execution Time




                            0.8


                            0.6


                            0.4


                            0.2

                            0.0
                                  Scientific        Multiprogrammed       Commercial       Arithmetic Mean

                                               512B Regions    512B Regions, half number of sets


June 7, 2005                                                  ISCA 2005                                    28

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:7
posted:8/13/2011
language:English
pages:28