Cache Concious Allocation of Pointer Based Data Structures by nikeborome


									Cache Conscious Allocation
   of Pointer Based Data
 Structures, Revisited with
    HW/SW Prefetching

  by: Josefin Hallberg, Tuva Palm and
             Mats Brorsson

   Presented by: Lena Salman
     Software approach
Two techniques:
 Cache concious allocation -
    By far the most efficient
   Software prefetch –
    Better suited for automization, better
    for implementation in compilers.

    Combination of both cache-concious
    allocation and software prefetch, does not
    add significantly to performance
    Hardware approach
 Calculating and prefetching pointers
 Calculating pointer dependencies
 Effects of effectively predicting what to evict
 from the cache
 General HW prefetch –
More likely to pollute the cache
 Problem! All the hardware strategies take
 advantage of the increased locality of cache –
 consciously allocated data.
Prefetching and cache –
  concious allocation
Should complement each other’s
weakness –
  Reduce the prefetch overhead of fetching
   blocks with partially unwanted data.
  Prefetching should reduce the cache
   misses and miss latencies between the
Excellent improvement in execution time
Can be adapted to specific need by choosing
the cache-conscious block-size
(cc – block size)
Attempts to co-allocate data in the same
cache-line. Nodes are referenced after each
other on the same cache line.
Allocation to improve
   Cache – conscious
Attempts to allocate the data in the same
cache- line
Better locality can be achieved
Improved cache performance by a
reduction of misses

Does the cache-concious allocation of memory.
Takes extra argument – pointer to data structure
that is likely to be referenced.

         #ifdef CCMALOC
         child=ccmaloc(sizeof(struct node), parent)
         child= malloc(sizeof(struct node));
Takes pointer to data that is likely to
be referenced close ( in time) to the
newly allocated stucture
Invokes calls to the standard malloc():
   When allocating new cc-block
   When data is larger than cc-block
Otherwise: allocate in the empty slot of
= Cache – conscious blocks
Demands cache lines large enough to contain
more than one pointer structure
The bigger the blocks – the lower the miss-
rate if allocation is smart.
Can be set dynamically in software,
independently of the HW cache line size.
In our study cc-block size– 256B
hardware cache line size – 16B – 256B
Prefetching will reduce the cost of
cache – miss
Can be controlled by software and/or
Software results in extra instructions
Hardware leads to complexity in
 Software controlled
Implemented by including prefetch
instruction in the instruction set
Should be inserted well ahead of
reference, according to prefetch
In this study: we will use greedy
algorithm, by Mowry et al.
    Software prefetch –
     Greedy algorithm
  When a node is referenced, it
  prefetches all children at that node.
  Without extra calculation, can only be
  done to children, not to grandchildren
  Easier to control and optimize
  The risk of polluting the cache
(since prefetch only needed lines)
Software greedy
 Hardware Controlled
Depending on the algorithm used,
prefetching can occur when a miss is
Or when a hint is given by the
programmer through an instruction,
Or can always occur on certain types of
    Hardware prefetch
Techniques used:
 Prefetch on miss
 Tagged Prefetch

 Attempt to utilize spatial locality
 Do NOT analyze data access patterns
Prefetches the next sequantial line i+1,
when detecting miss on line i.

        Line i-1

       Line i : Miss!

       Line i+1 : will be prefetched
      Tagged Prefetch
Each prefetched line is tagged with a tag
When a prefetched line - i is referenced, the
line i+1 is prefetched.
(no miss has occurred)

Efficient when memory is fairly sequential,
and has been shown efficient
Pre-fetch on miss – for

HW prefetch can be combined with
ccmalloc(), by introducing a hint with
address to the beginning of such a
Prefetch-one-cc on miss
Prefetch the next line after detecting a cache
– miss on a cache-conciously allocated block.

Prefetch-all-cc on miss
 Decides dynamically how many lines to
 Depends on where on cc-block the
 missing cache line is located.
 Prefetches all the cache lines on the cc-
 block, from the address causing miss

Experimental Framework
MIPS-like, out-of-order processor simulator.
Memory latency equal to 50 ns random
access time.
    health – simulates columbian health care system
    mst – creates graph and calculates minimum span
    perimeter – calculates the perimeter of image
    treeadd – calculates recursive sum of values
More about benchmarks:
    health – elements are moved between lists during
     execution, and there is more calculation between
    mst – originally used a locality optimization
     procedure, which made ccmalloc() non noticeable.
    perimeter – data allocated in an order similar to
     access order, resulting locality optimization.
    treeadd – has calculation between nodes in a
     balanced binary tree.
Results: Execution time
Memory stall – an instruction waits a cycle,
due to the oldest instruction waiting to be
retired – load / store instr.
FU stall – the oldest instr. Is not load / store
Fetch stall – there is no instruction waiting to
be retired.

Prefetch is likely to affect when memory stalls
are dominant!!
Cache performance - SW
Miss rates are improved by most strategies
Increased spatial locality with ccmalloc()
reduces cache misses (less pollution)
Software shows some decrease of misses,
but prefetches a lot unused data
Combination of software techniques achieves
the lowest rates
Cache performance –
    cache lines
The larger cache lines        the more
effective is using ccmalloc()
HW prefetch alone, however, tends to
pollute the cache, with unwanted data
SW prefetch alone, tends to bring data
already existing in the cache
 Cache performance:

SW prefetch achieves higher precision
HW prefetch alone, are no good.
HW prefetch is more sensitive to cache
line size than the SW prefetch
 Cache performance –
SW pref. with ccmalloc()

 Results in increased amount of used
 cache lines, among the prefetched lines
    This is caused by increased spatial locality
 However! Also results trying prefetching
 lines already in the cache.
Cache performance –
  HW prefetch with
HW are greater improvement with cache-
conscious allocation, then on their own,
Prefetch-on-miss and tagged-prefetch both
show the same results
Still : large amount of unused prefetched lines
Unused lines decrease with larger cache
lines, due to spatial locality, and lack of need
to prefetch
 The best way still remains cache
  conscious allocation – ccmalloc()
 Efficient to overcome the drawbacks of
  large cache line
   Creates locality necessary for prefetch
 The larger the cache line – less
  prominent the prefetch strategy
        Conclusions 2:
 Cache-conscious allocation with HW
  prefetch, is not prominent, and it seems
  that ccmalloc() alone is enough
 However, ccmalloc() can be used to
  overcome the negative effect of next-
  line prefetch
 HW prefetch is better then SW prefetch
         Conclusions 3:
 When a compiler can use profiling info.
  and optimize memory allocation in
  cache-conscious manner – it’s
 However, when profiling is too
  expensive – will likely to benefit from
  general prefetch support.
The endddd !!!

             You can tell
              me, I can
              take it..

             What’s up
 ‫לנה סלמן‬

To top