Documents
Resources
Learning Center
Upload
Plans & pricing Sign in
Sign Out

slides

VIEWS: 7 PAGES: 28

									   Improving Multiprocessor
Performance with Coarse-Grain
     Coherence Tracking
Jason F. Cantin, Mikko H. Lipasti, and James E. Smith
 International Symposium on Computer Architecture
                    June 7th, 2005
               Overview of Idea
Coarse-Grain Coherence Tracking:

1. Monitors coherence status of memory at a
   multi-line granularity

2. Uses the coarse-grain information to identify
   requests that don’t need a coherence
   broadcast

3. Sends these requests directly to memory
June 7, 2005         ISCA 2005                 2
                    Problem
                                       Broadcast Network
                                        Data Network

     NC        $
                    P                  P               P
     MC        P

      DRAM         DRAM               DRAM        DRAM

Snoop-based systems support a limited
number of processors
    – Limited broadcast bandwidth
    – Increasing memory latency
June 7, 2005              ISCA 2005                        3
                 Opportunity
• Some data requests don’t need a broadcast
    – Requests for non-shared data
    – Fetches of unmodified instructions
    – Write-backs

• Some non-data requests don’t need to leave the
  processor
    – Requests to upgrade copy, but not shared
    – Requests to flush copies, but not cached elsewhere
June 7, 2005             ISCA 2005                    4
                  Unnecessary Broadcasts
           100%                         93%

            80%
                                                       65%             67%
                          62%
                                                                                    Write-back
Requests




            60%
                                                                                    DCB

            40%                                                                     Write

                                                                                    I-Fetch
            20%                                                                     Read


             0%
                      Scientific   Multiprogrammed   Commercial   Arithmetic Mean



           June 7, 2005                        ISCA 2005                                    5
               Our Approach
• Identify requests that don’t need a broadcast

• Send data requests directly to memory
    – Reduce broadcast traffic
    – Reduce latency in some systems

• Avoid sending non-data requests externally
    – Further reduce broadcast traffic
    – Reduce latency

June 7, 2005             ISCA 2005                6
Coarse-Grain Coherence Tracking

• Memory is divided into coarse-grain regions
    – Aligned, power-of-two multiple of cache line size
    – Can range from two lines to a physical page


• A cache-like structure is added to each
  processor for monitoring coherence at the
  granularity of regions
    – Region Coherence Array (RCA)


 June 7, 2005             ISCA 2005                       7
Coarse-Grain Coherence Tracking
• Each entry has an address tag, state, and
  count of lines cached by the processor

• The state indicates if the processor and / or
  other processors are sharing / modifying lines
  in the region

• On cache misses, the region state is read to
  determine if a broadcast is necessary

 June 7, 2005         ISCA 2005                    8
Coarse-Grain Coherence Tracking

 • On snoops, the region state provides a
   response for the region
     – Piggy-backed onto the conventional response
     – Used to update other processors’ region state


 • RCA maintains inclusion over caches
     – When regions are evicted, their lines are evicted
     – RCA must respond correctly if region’s lines cached
     – Replacement algorithm uses line count
 June 7, 2005            ISCA 2005                      9
     Example: Conventional Snooping

                                                Network
                           Read: P0, 100002
                                     Read: P0, 100002
                                                        Invalid             Invalid
                               Tag      State
• P0 loads 100002
                              0010 Pending
                                    Invalid
                              0000 Exclusive                      0000     Invalid
    MISS                            $0                                  $1
                              0000    Invalid                     0000    Invalid
                                       Data 10000
• Snoop performed                      Load:     2

                    Data

• Response sent                      P0                                  P1

• Data transfer
                                  M0                                 M1
    June 7, 2005             ISCA 2005                                           10
  Coarse-Grain Coherence Tracking
Region Coherence
Array added; two                                          Network exclusive
                                                             P0 has
                                                                    Invalid, 10000
                                                                   Read: P ,Region 2
                                                                access to0regionNot Shared
lines per region                     Read: P0, 100002                       Invalid, Region Not Shared
                            Tag      State
• P0 loads 100002
                           0010 Pending
                                 Invalid
                           0000 Exclusive         000 Pending
                                                  001 Invalid
                                                        DI         0000     Invalid    000   Invalid
    MISS                         $0               RCA                    $1            RCA
                           0000    Invalid          Invalid
                                                  000              0000    Invalid       Invalid
                                                                                       000

• Snoop performed                    Data
                                                Load: 100002


• Response sent                   P0                                      P1

                    Data
• Data transfer
                               M0                                     M1
    June 7, 2005                             ISCA 2005                                       11
  Coarse-Grain Coherence Tracking
Region Coherence
Array added; two                                             Network
lines per region                                                 Exclusive region state,
                              Tag       State                    broadcast unnecessary
• P0 loads 110002
                             0010 Exclusive       001   DI         0000     Invalid   000    Invalid
    MISS, Region Hit              0$
                             0000 Exclusive
                             0011 Pending
                                   Invalid
                                                   RCA
                                                  000
                                                    Invalid        0000
                                                                          $1
                                                                           Invalid
                                                                                       RCA
                                                                                        Invalid
                                                                                      000

• Direct request sent                   Data
                                                Load: 110002


• Data transfer                     P0                                    P1
         Read: P0, 110002
                      Data

                                 M0                                    M1
    June 7, 2005                            ISCA 2005                                       12
  Coarse-Grain Coherence Tracking
Region Coherence
Array added; two                                   Network
                                                        Region not exclusive
                        Owned, Region Owned               RFO: P1, 100002
lines per region             Owned, Region Owned               anymore
                                                                RFO: P1, 100002
• P1 stores 100002
                             Pending
                              Invalid
                        0010 Exclusive    001    DI
                                                 DD       0010 Pending
                                                          0000 Modified
                                                                Invalid       001 Invalid
                                                                                    DD
                                                                              000 Pending
    MISS                   $ 0            RCA                   $1            RCA
                        0011 Exclusive      Invalid
                                          000             0000    Invalid       Invalid
                                                                              000
                                                                      Data
• Snoop performed                               Data                         Store: 100002
    Hits in P0 cache
                            P0                                   P1
• Response sent

• Data transfer
                            M0                                M1
    June 7, 2005                     ISCA 2005                                      13
                   Overhead
• Storage space needed for RCA
    – 3-6% storage overhead for cache


• Two bits needed in snoop response for region
  response

• Path to memory needed to avoid broadcasts
    – Simple with on-chip memory controllers
    – May leverage data network

June 7, 2005            ISCA 2005              14
                     Simulator
PHARMsim:

• Execution-driven simulator built on top of SimOS-PPC

• Four 4-way superscalar out-of-order processors

• Two-level hierarchy with split L1, unified L2 caches

• Separate address / data networks –similar to Fireplane

• Region Coherence Array with same sets/assoc. as L2


June 7, 2005               ISCA 2005                       15
                 Workloads
• Scientific
    – Ocean, Raytrace, Barnes

• Multiprogrammed
    – SPECint2000_rate


• Commercial
    – TPC-W, TPC-B, TPC-H, SPECweb99,
      SPECjbb2000

June 7, 2005             ISCA 2005      16
                             Broadcasts Avoided
           100%


           80%
                                                                                                           67%
                                                                                                                           56%          Write-back
Requests




           60%
                                                                                                                                        DCB

           40%                                                                                                                          Write

                                                                                                                                        I-Fetch
           20%                                                                                                                          Read

            0%
                  Oracle




                                                Oracle




                                                                              Oracle




                                                                                                           Oracle
                                          1KB




                                                                        1KB




                                                                                                     1KB




                                                                                                                                  1KB
                            256B

                                   512B




                                                         256B
                                                                512B




                                                                                       256B
                                                                                              512B




                                                                                                                    256B

                                                                                                                           512B
                           Scientific           Multiprogrammed                   Commercial               Arithmetic Mean


     June 7, 2005                                                      ISCA 2005                                                               17
  Snoop Traffic Reduction – Peak
                              8000
   Broadcasts / 100K Cycles




                              6000
                                              64%

                              4000
                                                                                           51%

                              2000                                    38%


                                 0
                                     Scientific          Multiprogrammed        Commercial

                                          Peak Traffic    Peak Traffic with 512B Regions

June 7, 2005                                        ISCA 2005                                    18
Snoop Traffic Reduction – Average
                              4000
   Broadcasts / 100K Cycles




                              2000




                                                  74%                                      47%
                                                                      86%
                                 0
                                     Scientific           Multiprogrammed        Commercial

                                       Average Traffic     Average Traffic with 512B Regions

June 7, 2005                                            ISCA 2005                                19
                                          Execution Time
                                                                                               91.2%
                            1.0
Normalized Execution Time




                            0.8


                            0.6


                            0.4


                            0.2


                            0.0
                                  Scientific       Multiprogrammed      Commercial       Arithmetic Mean

                                        Baseline    256B Regions     512B Regions    1KB Regions

     June 7, 2005                                        ISCA 2005                                   20
         Remaining Opportunity
• With 512B regions, ~10% of requests are
  broadcast unnecessarily

• A third of the 10% are region false sharing

• Half of the 10% miss in RCA
   – Potential for prefetching

June 7, 2005           ISCA 2005          21
                         Inclusion Overhead
                  100%

                  80%
  L2 Miss Ratio




                  60%

                  40%
                             +0.23%                                          +0.56%
                  20%
                                                        +0.04%
                   0%
                          Scientific           Multiprogrammed         Commercial

                                       Baseline miss rate   512B miss rate

--Regions with no lines cached replaced first
 June 7, 2005                               ISCA 2005                                 22
                  Conclusion
Coarse-Grain Coherence Tracking:

• Reduces broadcast traffic
    – Most data requests sent directly to memory

• Reduces latency
    – Many requests not sent to central arbitration point
    – Many non-data requests not sent externally

• Improves scalability and performance
June 7, 2005             ISCA 2005                      23
               The End




June 7, 2005     ISCA 2005   24
                                    Inclusion Evictions
                                                                               2 lines evicted
                     100%
                                                                               1 line evicted
  Region Evictions




                     80%                                                       0 lines evicted

                     60%

                     40%

                     20%

                      0%
                                                       ed
                                    ific




                                                                           l
                                                                        ia
                                                    m




                                                                        c
                                 nt




                                                                     er
                                                  am
                               ie




                                                                    m
                            Sc




                                                  gr




                                                                   om
                                                ro




                                                               C
                                             tip
                                           ul
                                           M




June 7, 2005                                           ISCA 2005                           25
                        Ordering
• Ordering point is now the Region Coherence Array
    – A direct request is ordered once it accesses the RCA


• Direct requests are serialized w.r.t. to snoop requests
    – A direct request occurs either before, or after a snoop
    – All must appear to access and update RCA atomically

• No two processors can have exclusive access to a
  region at the same time (no races)



June 7, 2005                 ISCA 2005                          26
     Comparison to RegionScout
                                CGCT                    RegionScout
    Optimization                Latency                    Power
Avoids broadcast for
                                  Yes                       Yes
non-shared data
Avoids broadcast for
                                  Yes                        No
clean data
Avoids tag lookups on
                                  No                   Yes –Like Jetty
snoops
Region state storage        Inclusive cache        Hash table, small cache
Region state transfer   2 bits in snoop response   1 bit in snoop response
  Region protocol               7 states             Effectively 4 states

   June 7, 2005                 ISCA 2005                           27
                                          Execution Time
                                                                                                   92.2%
                            1.0
                                                                                              91.2%
Normalized Execution Time




                            0.8


                            0.6


                            0.4


                            0.2

                            0.0
                                  Scientific        Multiprogrammed       Commercial       Arithmetic Mean

                                               512B Regions    512B Regions, half number of sets


June 7, 2005                                                  ISCA 2005                                    28

								
To top