Methods by zhangyun

VIEWS: 25 PAGES: 114

									Signatures in Transactional
    Memory Systems
       Dissertation Defense

            Luke Yen
            1/29/2009
                  Key Contributions

Trend: Transactional memory (TM) emerging parallel programming
  paradigm. Programmer-annotated transactions that execute
  atomically (all or nothing).




Challenge #1: Hardware TM (HTM) systems may restrict
  transactions or incur overheads on common events (e.g., cache
  evictions).
Contribution: LogTM-SE HTM: Simple hardware and interacts with
  operating system to virtualize transactions. No overhead on
  cache evictions.



                                                             2
              Key Contributions Cont.
Challenge #2: (1) H3 signatures high area & power overheads &
(2) Thread-private references cause false conflicts.
Contribution: Notary: (1) Page-Block-XOR - performs similar to H3
   but lower overheads (2) Stack & heap-based privatization.

Challenge #3: Difficult to understand HTM system performance.
Contribution: TMProf: Lightweight hardware performance counters
  help HTM designers & TM programmers.

Challenge #4: Signatures suffer from false conflicts.
Contribution: Six hardware/software signature extensions to
  mitigate false conflicts.


                                                               3
                      Outline
• Introduction and Background
  • Transactional Memory background
  • LogTM-SE [HPCA 2007]                      Contribution #1


• Notary [MICRO 2008]                         Contribution #2
                                               Focus of
                                               presentation
• TMProf (Submitted for publication)          Contribution #3


• Conclusion

          * Skip “Extensions to Signatures”   Contribution #4
                                                        4
         Transactional Memory (TM)
• Locks do not compose
  • Can lead to deadlocks

• TM programmer says        void move(T s, T d, Obj key){
                              atomic {
  • “I want this atomic”       tmp = s.remove(key);
• TM system                    d.insert(key, tmp);
                              }
  • “Makes it so”
                            }


• Focus on Hardware TM (HTM) Implementations
  • Fast
  • Leverage cache coherence & speculation
  • But hardware finite & should be policy-free
                                                  Example   6/28/2011
                    5
  LogTM Signature Edition (LogTM-SE) at
              50,000 feet
• HTMs Fast
  • Version management – for transaction commits & aborts
     • HW handles old/new versions (e.g., write buffer)
  • Conflict detection – commit only non-conflicting transactions
     • HW handles conflict detection (R/W bits & coherence)
• But Closely Coupled to L1 cache
  • On critical paths & hard for SW to save/restore


• Our Approach: Decoupled, Simple HW, SW control
• LogTM-SE
  • HW: LogTM‟s Log + Signatures (from Illinois Bulk)
  • SW: Unbounded nesting, thread switching, & paging
                                                              Details   6
              Signature Background
• Signatures used to summarize and detect conflicts
  with a transaction‟s read- and write-sets
   •   Inspired by Bulk system [Ceze,ISCA‟06]
   •   Imprecise, can be implemented with Bloom filters
   •   Can have false positives, but never false negatives
   •   Also proposed for non-TM purposes (e.g., SC violation
       detection, atomicity violation detection, race recording)


• Ex: Use k Bloom filters of size m/k, with independent
  hash functions




                                                                   7
                      Outline

• Introduction and Background
• Notary
  •   Signature Background
  •   Entropy & Page-Block-XOR
  •   Privatization
  •   Methodology & Results
  •   Conclusions
• TMProf
• Conclusion


                                 8
            Notary Executive Summary

Tackle 2 problems with hardware signatures:
• Problem 1: Best signature hashing (i.e., H3) has high area &
  power overheads
• Solution 1: Use entropy analysis to guide lower-cost hashing
  (Page-Block-XOR, PBX) that performs similar to H3
   • Ex: 8x fewer gates - 160 gates for H3 vs 20 gates for PBX

• Problem 2: Spurious signature conflicts caused by signature
  bits set by private memory addrs
• Solution 2: Avoid inserting private stack addrs, propose
  privatization interface for higher performance

                                                                9
                      Outline

• Introduction and Background
• Notary
  •   Signature Background
  •   Entropy & Page-Block-XOR
  •   Privatization
  •   Methodology & Results
  •   Conclusions
• TMProf
• Conclusion


                                 10
               Signature hash functions
• Which hash function is best? [Sanchez, YEN, MICRO‟07]
   • Bit-selection? Hash simply decodes some number of input bits
   • H3? Each bit of a hash value is an XOR of (on avg.) half of the input
     address bits


LogTM-SE
     w/ 2kb
 signatures




 • Result: H3 better with >=2 hash functions
 • However, H3 uses many multi-level XOR trees
     •Can we improve this?
                                                                             11
                                                                   Details
            H3 implementation




           addr length in bits
• Num XOR                     ck
                   4

• Ex: 2kb signatures, k=2, c=10, 32-bit addr =
  160 XOR gates per signature
• Can we reduce the total gate count?
                                                 12
                     Outline

• Introduction and Background
• Notary
  •   Signature Background
  •   Entropy & Page-Block-XOR
  •   Privatization
  •   Methodology & Results
  •   Conclusions
• TMProf
• Conclusion


                                 13
                 Entropy defined
• Insight: Use most random bits for hashing
   • Use entropy to measure bit randomness
                   N
                 p ( xi ) log 2 ( p ( xi ))
• Entropy =
                  i 1
• p(xi) = the probability of the occurrence of value xi
• N = number of sample values random variable x can
  take on
• Entropy = amount of information required on average
  to describe outcome of variable x (in bits)
   • Ex: What is the best possible lossless compression?
                          Other cases
  0 bits                                                n bits
   min           Entropy value of n-bit field            max
     n-bit field                       All bit patterns
     constant value                    in n-bit field
                                                                 14
     with probability 1                equally probable
            Our measures of entropy
• For our workloads, we care about:
• Q1: What is the best achievable entropy?
     • Global entropy – upper bound on entropy of
       address
• Q2: How does entropy change within an
  address?
     • Local entropy – entropy of bit-field within the
       address


31        Addr          6        31        Addr entropy
                                             Local    6
       Global entropy                                NSkip


                                                             15
                     Entropy results
• Workloads to be described later
• Global entropy is at most 16 bits
• Bit-window for local entropy is 16 bits wide (NSkip from 0-10)
   • Smaller windows (<16b) may not reach global entropy value
   • Larger windows (>16b) hides some fine-grain info




                                                                       16
                                                Commercial Workloads
              Page-Block-XOR (PBX)
• Motivated by 3 findings:
  • (1) Lower-order bits have most entropy
     • Follows from our entropy results
  • (2) XORing two bit-fields produces random hash values
     • From prior work on XOR hashing (e.g., data placement in caches,
       DRAM)
  • (3) Bit-field overlaps can lead to higher false positives
     • Correlation between the two bit-fields can reduce the range of
       hash values produced (worse for larger signatures)




                                                        Overlap Details   17
             PBX implementation
• For 2kb signatures with 2 hash functions:
  • 20 XOR gates for PBX vs 160 XOR gates for H3!




• PPN and Cache-index fields not tied to system
  params:
  • Use entropy to find two non-overlapping bit-fields with
    high randomness
                                                          18
                 Summary thus far

• Problem 1: H3 has high area & power overheads
• Solution 1: Use entropy analysis to guide lower-cost
  PBX
   • Ex: 160 gates for H3 vs 20 gates for PBX


• Problem 2: Spurious signature conflicts caused by
  signature bits set by private memory addrs
• Solution 2: To be described




                                                         19
                      Outline

• Introduction and Background
• Notary
  •   Signature Background
  •   Entropy & Page-Block-XOR
  •   Privatization
  •   Methodology & Results
  •   Conclusions
• TMProf
• Conclusion


                                 20
                        Privatization
• Problem: False conflicts caused by thread-private
  addrs
  • Avoid conflicts if addrs not inserted in thread‟s signatures


Two privatization solutions:
  • (1) Remove private stack references from sigs.
     • Very little work for programmer/compiler
     • Benefits depend on fraction of stack addresses versus all
       transactional references
  • (2) Language-level interface (e.g., private_malloc(),
    shared_malloc())
     • Even higher performance boost
     • WARNING: Incorrectly marking shared objects as private can lead
       to program errors!                                            21
      Page-based implementation

• Each page is assigned a status, private or
  shared
  • Invariant: Page is shared if any object is shared
• If stack is private, library marks stack pages as
  private
• If using privatization heap functions, mark heap
  pages accordingly




                                                        22
                     OS support
• OS allocates different physical page frames for
  shared and private pages
  • Sets a per-frame bit in translation entry if shared
  • Reduce number of page frames used by packing
    objects with same status together


• Signatures insert memory addresses of
  transactional references to shared pages
  • Query page sharing bit in HW TLB & current
    transactional status


                                                          23
                      Outline

• Introduction and Background
• Notary
  •   Signature Background
  •   Entropy & Page-Block-XOR
  •   Privatization
  •   Methodology & Results
  •   Conclusions
• TMProf
• Conclusion


                                 24
                       Methodology
•   Full-system simulation (GEMS)
•   Transistor-level design for area & power of XOR gates
•   CACTI for Bloom filter bit array area & power
•   Linear scaling to 65nm or 90nm for area, original 400nm
    for power




• Single-chip CMP
    •   16 single-threaded, in-order cores
    •   32kB, 4-way private L1 I & D
    •   8MB, 8-way shared L2 cache
    •   MESI directory protocol
                                                              25
    •   Signatures from 64b-64kb (8B-8kB) & “perfect”
                    Workloads
• Micro-benchmarks
• SPLASH-2 apps
  • Barnes & Raytrace – exert most signature pressure
• Stanford STAMP apps
  • Vacation, Genome, Delaunay, Bayes, Labyrinth, Yada,
    Intruder
• DNS server
  • BIND




                                                          26
                PBX vs H3 area & power

   • Area & power overheads (2kb, k=4):

Type of  Bloom        H3 hash   PBX       H3 sig.   PBX sig.   %
overhead filter bit             hash                           savings
         array                                                 for PBX
                                                               sig.
Area       4.67e-3    1.35e-3   7.83e-5   6.02e-3   4.75e-3    21
(mm2)

Power      1.80e2     1.04e1    1.02      1.90e2    1.81e2     4.7
(mW)




                                                                     27
PBX vs H3 execution time




   PBX performs similar to H3

                                28
       Privatization results summary
• Removing private stack references from
  signatures did not help
  • Most addr references not to stack
  • Most likely because running with SPARC ISA. Other
    ISAs (e.g., x86) likely have more benefits


• Privatization interface helps five workloads
  • Remainder either does not have private heap
    structures or does not have high transactional duty
    cycle


                                               Stack Results   29
Privatization interface results




                        Can improve
                        execution time


                                         30
                      Outline

• Introduction and Background
• Notary
  •   Signature Background
  •   Entropy & Page-Block-XOR
  •   Privatization
  •   Methodology & Results
  •   Conclusions
• TMProf
• Conclusion


                                 31
                      Conclusions
• Tackle 2 problems with signature designs:
   • (1) Area and power overheads of H3 hashing
       • E.g., 160 XOR gates for H3, 20 for PBX
   • (2) False conflicts due to signature bits set by private
     memory references
• Our solutions:
   • (1) Use entropy analysis to guide hashing function (PBX), a
     low-cost alternative that performs similarly to H3
   • (2) Prevent private stack references from entering
     signatures, and propose a privatization interface for heap
     allocations


• Notary can be applied to non-TM uses:
   • PBX hashing can directly transfer
   • Privatization may transfer if addr filtering applies
                                                            Related Work   32
                         Outline

• Introduction and Background
• Notary
• TMProf
  •   Motivation
  •   Background
  •   TMProf
  •   Two Case Studies
  •   Future Directions for TMProf
  •   Conclusions
• Conclusion
                                     33
        TMProf Executive Summary
• TM more parallelism than lock-based programs
  • Complex thread interactions
• How can HTM designer understand HTM
  performance?
• How can TM programmer understand TM
  program performance?

• TMProf: Per-processor hardware performance
  counters to count cumulative event frequencies &
  overheads in HTM system

                                                 34
                         Outline

• Introduction and Background
• Notary
• TMProf
  •   Motivation
  •   Background
  •   TMProf
  •   Two Case Studies
  •   Future Directions for TMProf
  •   Conclusions
• Conclusion
                                     35
         Critical-section Parallelism
• TM enables critical-section parallelism – more
  thread interleavings
     With Locks                    With TM
 Thread 0                   Thread 0
            Thread 1                      Thread 1
   Lock A                    xact_begin
             Lock A                       xact_begin




                                                       36
  Hard to Predict Program Performance
• TM programmers may not have mastered
  intricacies of HTM system
• Programs run faster on specific HTM
• Example:




                                         37
           Profiling with TMProf
• Allows HTM designers & TM programmers to
  understand HTM performance
• With TMProf:




                                             38
                         Outline

• Introduction and Background
• Notary
• TMProf
  •   Motivation
  •   Background
  •   TMProf
  •   Two Case Studies
  •   Future Directions for TMProf
  •   Conclusions
• Conclusion
                                     39
         Background on Conflicts
                           Thread 0     Thread 1
• Three types: RW,
                           xact_begin
  WR, and WW                …           xact_begin
• Analogous to WAR,   RW    LD A         …
                            …            ST A
  RAW, and WAW                           …
  dependencies in          xact_begin
  uniprocessors             …           xact_begin
                      WR    ST B         …
                            …            LD B
                                         …
                           xact_begin
                            …            xact_begin
                      WW    ST C          …
                            …             ST C
                                          …
                                                      40
       Conflict Detection & Resolution
• Conflicts detected eagerly or lazily
   • Eagerly – when requests occur
   • Lazily – at transaction commit


• Conflict resolution
   • Stall or abort on conflict
   • Choose set of procs to take action




                                          41
                         Outline

• Introduction and Background
• Notary
• TMProf
  •   Motivation
  •   Background
  •   TMProf
  •   Two Case Studies
  •   Future Directions for TMProf
  •   Conclusions
• Conclusion
                                     42
                    TMProf
• Per-processor HW counters measuring cumulative
  event frequencies and cumulative event overheads

• Two implementations: Base & Extended

• Base (BaseTMProf): Breaks down HTM execution
  cycles into common components

• Extended (ExtTMProf): Builds on BaseTMProf & adds
  HTM-specific transaction-level profiling



                                                     43
            BaseTMProf & ExtTMProf
• BaseTMProf:
    • Total cycles = stalls + aborts + wasted_trans +
      useful_trans + committing + nontrans + implementation
      specific
    • Assume in-order procs, but can extend for out-of-order
      procs


• ExtTMProf: BaseTMProf profiling plus
    • Size of aborted transactions
    • Amount of transactional work after write-set prediction
    • HTMs may add more detailed profiling in future

Details
                                                            44
                         Outline

• Introduction and Background
• Notary
• TMProf
  •   Motivation
  •   Background
  •   TMProf
  •   Two Case Studies
  •   Future Directions for TMProf
  •   Conclusions
• Conclusion
                                     45
                     Two Case Studies
• TMProf profiling two HTMs:
   • LogTM-SE (eager conflict detection & version management,
     EE)
   • Approximation of Stanford‟s TCC (lazy conflict detection &
     version management, LL)

• Examine key parameters of eager & lazy conflict detection
   • Idealize version management

• Same system parameters as Notary
   • 16-processor CMP w/ in-order, single-issue processor cores
   • Perfect signatures
   • Same workloads
                                                                  46
     EE: Different Conflict Resolutions
• Three different conflict resolutions:
   • Base, Timestamp, Hybrid
   • All use timestamps


• Base: Requestor stalls until possible deadlock

• Timestamp: Older requestors always abort
  younger transactions. Younger requestors stalled
  by older transactions.

• Hybrid: Base, except RW from older writer aborts
  younger reader                                 47
            EE: Write-set Prediction
• Avoid aborts from load then store pattern from
  thread
  • Predict & serialize on these conflicts
 T0    T1        T2                 T0        T1       T2
GetS ABORT
      …         …                  GetX      …        …
… ABORT
      …         GetS               …         …       STALL
                                                      GetS
…     GetS      …                  …         STALL
                                             GetS     …
GetX  …         …                  GetX      …        …
…                                  …




                                                         48
    Results from Conflict Resolutions




Trends:
1) Timestamp & Hybrid better than Base
                                         49
   Timestamp & Hybrid Better than Base




Fewer total stalls & eliminates all RW Requestor older stalls
                                                                50
       EE Summary with BaseTMProf
• BaseTMProf helps HTM designer understand
  performance of conflict resolution schemes

• Lightweight, fast, dynamic profiling
   • Can be implemented in prototype HTM systems




                                                   51
          Write-set Prediction Results
• Focus on workloads that degrade from prediction




       Prediction increases Stall cycles            52
  ExtTMProf’s Transaction-level Profiling




   Predictions Help          Predictions Hurt
Prediction helps short   Prediction hurts large
transactions             transactions – reduces
                         concurrency
                                                  53
        EE Summary with ExtTMProf
• Helps HTM designers understand why write-set
  prediction degrades (or improves) performance

• Offline analysis (e.g., traces) unable to determine
  performance implications of dynamic conflicts




• How can TMProf help analyze LL systems?


                                                   54
     LL: Parallel Versus Serial Commit
• Serial = Only one committer at a time

• Parallel = Multiple concurrent committers
  • Faster than Serial
  • We idealize its implementation




                                              55
          LL: More Prefetching than EE
• Eager conflict detection:
   • Progress bounded by location of conflicts
   • Early conflicts  abort transactions early (little prefetching)
   • Late conflicts  abort transactions late (lots of prefetching)


• Lazy conflict detection:
   • Committers finish transaction before detecting conflicts
   • High probability for lots of prefetching




                                                               56
         Parallel Commit Results




Parallel commit removes commit token bottleneck
                                                  57
      Conflicts with Parallel Commit




All conflicts either RW or WR – no WW conflicts

                                                  58
      LL Summary with BaseTMProf
• BaseTMProf clearly shows why parallel commit
  helps
• Stall breakdown shows mostly WR conflicts

• BaseTMProf helps HTM designers decide
  whether to implement parallel commit
  • Parallel commit more complex than serial commit




                                                      59
          Prefetching Results




Useful Trans should be similar for EE & LL, but
LL incurs fewer cycles
Why?                                              60
ExtTMProf’s Transaction-level Profiling




 LL’s aborted transactions prefetch farther than EE   61
       LL Summary with ExtTMProf
• Explains why workloads execute faster on LL
  than on EE

• May influence HTM design decision to implement
  LL rather than EE

• Helps TM programmer understand why programs
  run faster on some HTMs



                                                62
                       Outline

• Introduction and Background
• Notary
• TMProf
  •   Motivation
  •   Background
  •   TMProf
  •   Two Case Studies
  •   Future Directions for TMProf
  •   Conclusions
• Conclusion
                                     63
Software Rollback Better than Hardware
               Rollback




  Software rollback reduces Stalls & Wasted Trans
  May reduce contention in HTM?
                                                    64
    Hardware for Critical-path Profiling

• Counter-based profiling is not sufficient
• Multi-threaded programs exhibit variability:
   • Different dynamic code paths
   • Inter-thread dependencies
   • Memory latencies
• Factors change critical-path – longest control flow
  that determines execution time

• Hardware critical-path profiling can aid in
  understanding performance
   • Faster than offline, software analyses
                                                   65
                         Outline

• Introduction and Background
• Notary
• TMProf
  •   Motivation
  •   Background
  •   TMProf
  •   Two Case Studies
  •   Future Directions for TMProf
  •   Conclusions
• Conclusion
                                     66
                  Conclusions
• TMProf – lightweight per-processor hardware
  counters for understanding HTM performance
  • Cumulative event frequencies & overheads


• Two implementations: Base & Extended

• Two case studies: LogTM-SE & Approximation of
  TCC

• Future TMProf might add hardware support for
  critical-path profiling

                                           Related Work   67
                   Outline

• Introduction and Background

• Notary

• TMProf

• Conclusion




                                68
                 Conclusions
• Challenge #1: Hardware TM (HTM) systems may
  restrict transactions or incur overheads on
  common events.
• Contribution: LogTM-SE HTM

• Challenge #2: (1) H3 signatures high area &
  power overheads & (2) Thread-private references
  cause false conflicts.
• Contribution: Notary


                                               69
              Conclusions Cont.
• Challenge #3: Difficult to understand HTM
  system performance.
• Contribution: TMProf

• Challenge #4: Signatures suffer from false
  conflicts.
• Contribution: Six hardware/software extensions to
  signatures



                                                 70
      Other Research & Contributions
• OS Support for Virtualizing Transactional Memory
  [Swift et al. TRANSACT „08]
• Implementing Signatures for Transactional Memory
  [Sanchez et al. MICRO „07]
• Performance Pathologies in Hardware Transactional
  Memory [Bobba et al., ISCA ‟07 & Top Picks „08]
• Supporting Nested Transactional Memory in LogTM
  [Moravan et al., ASPLOS „06]

• GEMS 2.X development & support
  • SMT in Opal, LogTM-SE in Ruby
                                               71
Thank You!


Questions?




             72
Backup Slides




                73
  LogTM-SE Processor Hardware
                                   • Segmented log, like LogTM
Registers     Register
              Checkpoint
                                   • Track R / W sets with
LogFrame          TMcount            R / W signatures
                   Read                • Over-approximate R / W sets
  LogPtr
                   Write               • Tracks physical addresses
            SummaryRead                • Summary signature used for
            SummaryWrite                 virtualization
        SMT Thread Context
                                   • Conflict detection by
            Tag             Data     coherence protocol
  NO TM STATE                          • Check signatures on every
                                         memory access for SMT
                  Data Caches
                                                                       74
            Thread Switching Support
• Why?
  • Support long-running transactions


• What?
  • Conflict Detection for descheduled transactions


• How?
  • Summary Read / Write signatures:
    If thread t of process P is scheduled to use an active signature,
    the corresponding summary signature holds the union of the saved
    signatures from all descheduled threads from process P.


    Updated using TLB-shootdown-like mechanism
                                                                        75
             Handling Thread Switching
                                                WW 00000000
                                                  W       00000000
                                                         00000000
                                        SummaryR RR 00000000
                                             Summary
                                            Summary
                            OS                        00000000
                                                     00000000




            T1                   T2                          T3


                                                                             W    00000000
                                                                     Summary
                                                                             R    00000000


W       01001000            0100000                    00000000      W       0100000
                    W                          W
    R    01010010            01000010                   00000000         R     01010010
                        R                          R


         P1                  P2                         P3                   P4



                                                                                             76
               Handling Thread Switching
                                                        W        00000000
                                                                 01001000
                                                  SummaryR
                                  OS                            00000000
                                                                01010010

Deschedule




               T1                      T2                            T3


           W   00000000           W    00000000                W    00000000           W    00000000
     Summary              Summary                      Summary                 Summary
           R   00000000           R    00000000                R    00000000           R    00000000


    W     01001000                0100000                      00000000        W       0100000
            01001000      W                            W
        R   01010010                01000010                     00000000          R     01010010
              01010010        R                            R


           P1                       P2                           P3                    P4



                                                                                                       77
                  Handling Thread Switching
                                                             W        01001000
                                                                       W
                                                                      W    01001000
                                                                           01001000
                                                     SummarySummaryR 01010010
                                                            Summary
                                     OS                            R 01010010
                                                            R 01010010

Deschedule




                  T1                      T2                               T3


             W    00000000           W    00000000                  W     00000000            W    00000000
     Summary                 Summary                        Summary                   Summary
             R    00000000           R    00000000                  R     00000000            R    00000000


    W       01001000                 0100000                        00000000          W       0100000
                             W                              W
        R      01010010                01000010                       00000000            R     01010010
                                 R                              R


             P1                        P2                             P3                      P4



                                                                                                              78
             Handling Thread Switching
                                                       W        01001000
                                                SummaryR
                                OS                            01010010
                                                                                  T1




                                     T2                             T3


        W    00000000           W    01001000                W     01001000           W    00000000
Summary                 Summary                       Summary                 Summary
        R    00000000           R    01010010                R     01010010           R    00000000


W       00000000                0100000                      00000000         W       0100000
                        W                            W
    R     00000000                01000010                      00000000          R     01010010
                            R                            R


         P1                       P2                            P3                    P4



                                                                                                      79
 Thread Switching Support Summary

• Summary Read / Write signatures
  • Summarizes descheduled threads with
    active transactions


• One OS structure per process
                                          Coherence
• Check summary signature on every
  memory access

• Updated on transaction deschedule
  • Similar to TLB shootdown                     80
            Paging Support Summary
Problem:
   • Changing page frames
   • Need to maintain isolation on transactional blocks

Solution:
   On Page-Out:
   • Save Virtual -> Physical mapping
   On Page-In:
   • If different page frame, update signatures with physical
      address of transactional blocks in new page frame.



                                                           81
                Paging Support Animation
          Page-out               VP1                    Page-in
          PP1                                               PP2
 A                                                 A’
 B                     C’
                       A’
                       B’                 B’
                                          D’       B’
                                 C?
                                 D?
                                 A?
                                 B?
 C                                                 C’
                     Read sig.        Write sig.
 D                                                 D’
                         Y                 Y


Read & Write signatures isolate memory blocks from PP1 & PP2


 Return
                                                                  82
     BaseTMProf for LogTM-SE (1 of 3)

• Differentiate between read dependent & write
  dependent aborts
   • Meta-data (e.g., 3 bits for conflict types + 1 bit
     indicating if responder older) on NACK messages
   • Per-processor tables to track conflicts with other procs
   • RW conflict only = read-dependent


• Stall cycles = cycle conflict detected – cycle
  request sent to memory subsystem

• Abort cycles = cycle abort completes – cycle
  abort initiates                                          83
      BaseTMProf for LogTM-SE (2 of 3)
• Wasted_trans cycles = cycle abort initiates – cycle
  transaction begins
  • Store transaction begin cycle in separate register


• Commit cycles = cycle commit completes – cycle
  commit initiates
  • No commit actions = no commit cycles
  • Track cycle of start of commit action in separate register




                                                             84
     BaseTMProf for LogTM-SE (3 of 3)
• Nontrans cycles = cycle of transaction begin –
  cycle after last transaction commit
   • Track cycle of last transaction commit in separate
     register


• Backoff cycles = cycle retry transaction – cycle
  abort completes

• Barrier cycles = cycle exit barrier – cycle enter
  barrier

                                                          85
           ExtTMProf for LogTM-SE
• Work remaining after write-set prediction:
  • Store transaction size (read+write-set sizes) at each
    prediction - lazily copy to software or use many
    registers
  • At commit, subtract saved transaction size from final
    transaction size at commit
  • Differences processed by software to produce
    histograms


• Size of aborted transactions:
  • Store read- and write-set sizes of aborted transaction
    in separate registers
                                                             86
              BaseTMProf for TCC
• Stall cycles recorded at transaction commit
  • When write-sets broadcasted or commit request sent
    to directory


• No breakdown of read-dependent & write-
  dependent abort cycles
  • Since aborts do not stall winner (abortee)


• Committing cycles = cycle commit phase
  completes – cycle commit phase begins
  • Between cycle all stores flushed from write buffer &
    broadcasting write-set
                                                           87
                 ExtTMProf for TCC
• Size of aborted transactions:
    • Track read- and write-set sizes of aborted transactions
    • Just like for LogTM-SE




Return
                                                           88
      Extensions to Signatures Overview
• Six extensions to reduce false conflicts

•   Static Transaction Identifier (XID) Independence
•   Object Identifiers (IDs)
•   Spatial locality with static signatures    Best
                                               performance
•   Spatial locality with dynamic signatures
•   Coarse-fine hashing
•   Dynamic re-hashing

• Evaluate using ideal hardware & software
                                                     89
       XID Independence, Object IDs
• XID Independence:
  • Programmer declares set of static XIDs that conflict
    with each other
  • Information passed to hardware for conflict detection
  • Signature check only for XIDs that possibly conflict


• Object IDs:
  • High-level objects accessed by each transaction
     • E.g., Trees, hash buckets, nodes
  • Programmer declares set of objects accessed in
    transaction
  • Designed to handle dynamic, fine-grain conflicts
                                                            90
       Optimizing for Spatial locality
• Spatial locality exists in many programs
  • High probability of accessing memory addresses
    neighboring current address in future
  • Spatially local addresses may form a set that sets only
    a single signature bit
• Static signatures:
  • Signature hashes operate on fixed, larger granularity
    (i.e., greater than cache-block)
  • Granularity may not be suitable for all workloads
• Dynamic signatures:
  • A set of signatures that hash on different granularities
    & set of hit counters
  • Dynamically select which signature is “best” to use
                                                            91
Coarse-fine hashing, Dynamic re-hashing
• Coarse-fine hashing:
  • Split addresses into two regions: Coarse & Fine
     • Coarse – High-order address bits (e.g., page number)
     • Fine – Low-order address bits (e.g., multiple cache blocks)
  • Assign signature hashes to operate on Coarse & Fine
    bits


• Dynamic re-hashing:
  • False conflicts can be caused by bad luck
  • Dynamically alter hash functions – rotate input address
    bits before hashing
     • Transform persistent false conflicts into transient false conflicts

                                                                       92
                    Privatization interface

Privatization function                       Usage

shared_malloc(size),                         Dynamic allocation of shared
private_malloc(size)                         and private memory objects

shared_free(ptr),                            Frees up memory allocated by
private_free(ptr)                            shared or private allocators

privatize_barrier(num_threads, ptr, size),   Program threads come to a
publicize_barrier(num_threads, ptr, size)    common point to privatize or
                                             publicize an object. Must be
                                             used outside of transactions




                                                                            93
             Dynamic privatization
• Dynamically switch from private to shared, and
  vice versa

• If transitioning from private -> shared, safe to
  mark page as shared (at cost of performance)
• If transitioning from shared -> private, default
  policy is to disallow if there exists other shared
  objects on same page
   • Otherwise, trap to user software and let programmer
     call shared_free(), followed by private_malloc() on
     object
                                                           94
Bit-field overlaps hurt PBX




                              Return   95
Removing stack refs doesn’t help




                               Return   96
Entropy of commercial workloads




                                           97
                                  Return
              Type of Hash Functions

• In real programs, addresses neither independent nor
  uniformly distributed (key assumptions to derive
  PFP(n))
• But can generate hash values that are almost
  uniformly distributed and uncorrelated with good
  (universal/almost universal) hash functions
• Hash functions considered:




     Bit-selection                      H3      [Carter, CSS79]
  (inexpensive, low quality)    (moderate, higher quality)
                                                                  98
                                                        Return
                  Notary Related Work
• Hash functions for memory hierarchy designs
   •   Used to reduce cache, bank, or row-buffer contention
   •   XOR hashes [Gonzales „97, Seznec „93, Zhang „00]
   •   Polynomial hashes [Rau „91]
• Alternatives to XOR hashing            [Kharbutli „04,‟05]
   •   Prime modulo & odd-multiplier displacement hashing
   •   Reduce probability of bad hash values
   •   Can require modifying existing hardware (e.g., additional TLB bits or adders)
• Detailed analysis of XOR hashes [Vandierendonck „05]
   •   Linear-algebra based analysis
   •   Replacing & swapping columns can minimize the fan-in and maximum fan-out of
       XOR gates
• Previous uses of entropy
   •   Overheads of addressing memory in ISA [Hammerstrom „77]
   •   Base Register Cache to reduce size of transferred address [Park „90]
   •   Mechanisms which compact & expand address & data values [Citron „95]
   •   Low-power TLB design [Ballapuram ‟06]
                                                                                       99
             Notary Related Work Cont.
 • Software-only privatization
     • Four pointer types for STMs [Scott „07]
     • exclude & only keywords for transactional OpenMP
       [Milovanovic „07]
     • private & shared keywords in OpenTM [Baek „07]
     • protect() and unprotect() for transactional C#
       [Abadi „08]


 • Hardware support for privatization
     • Virtual Memory Filter [Matveev „07]
         • More general than Notary‟s privatization
         • Programmer declares memory regions to be transactional
Return                                                              100
             TMProf Related Work
• Profiling transaction characteristics &
  implementation-specific features [Hammond „04]
  • Ex: Read- and write-set sizes, nesting depth, commit
    bandwidth
  • Disadvantage: Does not profile common, high-level
    HTM overheads
• Transactional Application Profiling Environment
  (TAPE) [Chafi „05]
  • Profiles TCC HTM & summarizes problem areas back
    to source code lines
  • Disadvantage: Tied to TCC-specific overheads

                                                       101
           TMProf Related Work Cont.
• Performance Pathologies [Bobba „07]
    • Identified several pathologies affecting performance of
      eager & lazy HTM systems
    • Disadvantage: Pathologies identified offline using
      detailed traces
• Additional profiling [Perfumo ‟08, Porter „08]
    • Metrics like read-to-write ratio, abort rate
    • Statically predicting TM performance using Syncchar
    • Can be added to TMProf implementations



Return
                                                           102
    Results from Conflict Resolutions




Trends:
1) Timestamp & Hybrid better than Base
2) Hybrid sometimes better than Timestamp   103
        Hybrid Better than Timestamp


                               Fewer Stalls &
                               Wasted Trans




Fewer Wasted Trans



                                           104
    Results from Conflict Resolutions




Trends:
3) Timestamp can be worse than Base
4) Hybrid can be worse than Base        105
            Timestamp Worse than Base


                                More Wasted Trans




Fewer WW Req.
Older Stalls
(i.e., more younger
thread aborts)
                                             106
               Hybrid Worse than Base

                                  More RW Req.
                                  younger stalls




Leads to load
imbalance
(more Barrier cycles)

                                                   107
                 Stall Breakdown




Prediction serializes read requests from older transactions
                                                              108
                 Stall Breakdown




Prediction serializes write requests (perhaps unnecessarily)
                                                           109
                  Locks are Hard
 // WITH LOCKS
 void move(T s, T d, Obj key){
   LOCK(s);
   LOCK(d);                    Moreover
   tmp = s.remove(key);
   d.insert(key, tmp);         Coarse-grain locking limits
   UNLOCK(d);
   UNLOCK(s);
                               concurrency
 }
Thread 0                         Fine-grain locking difficult
move(a, b, key1);     Thread 1

           move(b, a, key2);

       DEADLOCK!

                                                 Return
                                                          110
Motivation




             111
            Background on Aborts
• Read-dependent & Write-dependent
  • Read-dependent – conflict is RW only
  • Write-dependent – conflicts include WR or WW


• HTM system may optimize for read-dependent
  aborts
  • E.g., Eager conflict detection can release read-isolation
    early on aborts (no nesting)
  • Does not stall requestor




                                                          112
                  Notary Future Work
 • Dynamic entropy calculation:
      • How to adapt PBX hashing to entropy changes over time?
 • Dynamic privatization characteristics:
      • How common is it for objects to change sharing status?




Related Work                                                 113
Sun’s Rock HTM [Dice et al., ASPLOS’09]
• Best-effort HTM – 1st general-purpose processor
  with HTM support
• Profiling targets why transactions fail
  • TMProf profiles higher-level categories, including
    successes
• Aborts update Checkpoint Status Register (CPS)
• Version R2 includes more detailed breakdowns of
  CPS than R1
  • Different reasons for failure given same CPS status in
    R1
• Profiling in common with ExtTMProf: read- and
  write-set sizes of aborted transactions                114

								
To top