Docstoc

faculty.kfupm.edu.saCOEmudawarcoe502111p5-Tra

Document Sample
faculty.kfupm.edu.saCOEmudawarcoe502111p5-Tra Powered By Docstoc
					 coe-502 paper presentation 2



transactional coherence and
        consistency



  presenters: muhammad mohsin Butt. (g201103010)
                    outline
• introduction
• current hardware
• tcc in hardware
• tcc in software
• performance evaluation
• conclusion.
                        intoduction

• Transactional Coherence and Consistency (TCC) provides a
  lock free transactional model which simplifies parallel
  hardware and software.

• Transactions are the basic unit of parallel work which are
  defined by the programmer.

• Memory coherence, communication              and   memory
  consistency are implicit in a transaction.
                 current Hardware

• Provide illusion of a single shared memory to all processors.
• Problem is divided into various parallel tasks that work on a
  shared data present in shared memory.
• Complex cache coherence protocols required.
• Memory consistency models are also required to ensure the
  correctness of the program.
• Locks used to prevent data races and provide sequential
  access.
• Too many locks overhead can degrade performance.
                 tcc in Hardware
• Processors execute speculative transactions in a continuous
  cycle.
• A transaction is a sequence of instructions marked by
  software that are guaranteed to execute and complete
  atomically.
• Provides All Transactions All The time model which
  simplifies parallel hardware and software.
                 tcc in Hardware
• When a transaction starts, it produces a block of writes in a
  local buffer while transaction is executing.
• After completing transaction, hardware arbitrates system
  wide for permission to commit the transaction.
• After acquiring permission, the node broadcasts the writes
  of the transaction as one single packet.
• Transmission as a single packet reduces number of inter
  processor messages and arbitrations.
• Other processors snoop on these write packets for
  dependence violation.
tcc in Hardware
                     tcc in Hardware
• TCC simplifies cache design
   • Processor hold data in unmodified and speculatively modified form.
   • During snooping invalidation is done if commit packet contains
     address only.
   • Update is done if commit packet contains address and data.

• Protection against data dependencies.
   • If a processor has read from any of the commit packet address, the
     transaction is re executed.
                   tcc in Hardware
• Current CMP need features that provide speculative
  buffering of memory references and commit arbitration
  control.
• Mechanism for gathering all modified cache lines from each
  transaction into a single packet is required.
   • Write Buffer completely separate from cache.
   • Address buffer containing list of tags for lines containing data to be
     committed.
                    tcc in Hardware
• Read BITs
   • Set on a speculative read during a transaction.
   • Current transaction is voilated and restarted if the snoop protocal
     sees a commit packet having address of a location whose read bit is
     set.

• Modified BITs
   • During a transaction stores set this bit to 1.
   • During violation lines having modified bit set to 1 are invalidated.
                    tcc in software
• Programming with TCC is a 3 Step process.
• Divide program into transactions.
• Specify Transactions Order.
   • Can be relaxed if not required.

• Tuning Performance
   • TCC provide feedback where in program the violations occur
     frequently
       loop Based parallelization
• Consider     Histogram   Calculation   for   1000   integer
  percentage
   /* input */
   int *in = load_data();
   int i, buckets[101];
   for (i = 0; i < 1000; i++) {
   buckets[data[i]]++;
   }
   /* output */
   print_buckets(buckets);
        loop Based parallelization
• Can be parallelized using.
t_for (i = 0; i < 1000; i++)
• Each loop body becomes a separate transaction.
• When two parallel iterations try to update same histogram
  bucket, TCC hardware causes later transaction to violate,
  forcing the later transaction to re execute.
• A conventional Shared memory model would require locks
  to protect histogram bins.
• Can be further optimized using
   • t_for_unordered()
       fork Based parallelization
• t_fork() forces the parent transaction to commit and
  create two completely new transactions.
  • One continues execution of remaining code
  • Second start executing the function provided in parameters. E.g
/* Initial setup */
int PC = INITIAL_PC;
int opcode = i_fetch(PC);
while (opcode ! = END_CODE){
t_fork(execute, &opcode,
1, 1, 1);
increment_PC(opcode, &PC);
opcode = i_fetch(PC);}
  explicit transaction commit ordering
• Provide partial ordering.
   • Done by assigning two parameters to each transaction
   • Sequence Number and Phase Number

• Transactions with same sequence number commit in an
  ordered way defined by programmer.
• Transactions     with    different    sequence     number   are
  independent.
• Order for transactions having same sequence numbered is
  achieved through phase number.
• Transaction having Lowest Phase number is executed first.
performance evaluation
             performance evaluation
• Maximize Parallelization.
   • Create as many transactions as possible

• Minimize Violations.
   • Keep transactions small to reduce amount of work lost on violation

• Minimize Transaction Overhead
   • Not To small size of transaction

• Avoid Buffer Overflow
   • Can result in excessive serialization
               performance evaluation
• Base Case.
   • Simple parallelization without any optimization.
• Unordered
   • Finding loops that can be un orderd.
• Reduction
   • Finding areas that exploit reduction operations
• Privatization
   • Privatize the variables to each transaction that cause violations.
• Using t_commit()
   • Break large transactions to small ones but execute on same processor.
      Reduces loss overhead due to violations and prevents buffer overflow.
• Loop Adjustments
   • Using various loop adjustments optimizations provided by the compiler.
performance evaluation
                               Inner Loops had too many violations
                               Using outer loop_adjust improved result


  Privatization and t_commit
  Improve performance
             performance evaluation




• CMP performance is close to Ideal TCC for small number of processors.
                      conclusions
• Bandwidth limitation is still a problem for scaling TCC to
  more processors.
• No support for nested for loops.
• Dynamic optimization techniques still required to automate
  performance tuning on TCC

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:1
posted:7/23/2013
language:Unknown
pages:22