insomniac_spu_programming_gdc08 by zhangyun

VIEWS: 22 PAGES: 113

									Insomniac’s SPU Best
     Practices
Hard-won lessons we’re applying to a 3rd
          generation PS3 title
              Mike Acton
           Eric Christensen
                GDC 08
              Introduction
• What will be covered...
  – Understanding SPU programming
  – Designing systems for the SPUs
  – SPU optimization tips
  – Breaking the 256K barrier
  – Looking to the future
                 Introduction
• Isn't it harder to program for the SPUs?
  – No.
  – Classical optimizations techniques still apply
  – Perhaps even more so than on other
    architectures.
     • e.g. In-order processing means predictable
       pipeline. Means easier to optimize.
  – Both at instruction-level and multi-processing
    level.
              Introduction
• Multi-processing is not new
  – Trouble with the SPUs usually is just trouble
    with multi-core.
  – You can't wish multi-core programming away.
    It's part of the job.
              Introduction
• But isn't programming for the SPUs
  different?
  – The SPU is not a magical beast only tamed by
    wizards.
  – It's just a CPU
  – Get your feet wet. Code something.
    • Highly Recommend Linux on the PS3!
               Introduction
• Seriously though. It's not the same, right?
  – Not the same if you've been sucked into one
    of the three big lies of software development...
              Introduction
• The “software as a platform" lie.
• The "domain-model design" lie.
• The "code design is more important than
  data design" lie
• ... The real difficulty is unlearning these
  mindsets.
                Introduction
• But what's changed?
  – Old model
    • Big semi truck. Stuff everything in. Then stuff some
      more. Then put some stuff up front. Then drive
      away.
  – New model
    • Fleet of Ford GTs taking off every five minutes.
      Each one only fits so much. Bucket brigade. Damn
      they're fast!
            Introduction
• But what about special code
  management?
  – Yes, you need to upload the code.
     • So what? Something needs to load
       the code on every CPU.
             Introduction
• But what about DMA'ing data?
  – Yes, you need to use a DMA controller
    to move around the data.
     • Not really different from calling
       memcpy
                          SPU DMA vs. PPU memcpy

                  SPU DMA                             PPU Memcpy


DMA from main ram to local store        PPU memcpy from far ram to near ram

wrch    $ch16,   ls_addr                mr $3, near_addr
wrch    $ch18,   main_addr              mr $4, far_addr
wrch    $ch19,   size                   mr $5, size
wrch    $ch20,   dma_tag                bl     memcpy
il      $2,      MFC_GET_CMD
wrch    $ch21,   $2




DMA from local store to main ram        PPU memcpy from near ram to far ram

wrch    $ch16,   ls_addr                mr $4, near_addr
wrch    $ch18,   main_addr              mr $3, far_addr
wrch    $ch19,   size                   mr $5, size
wrch    $ch20,   dma_tag                bl     memcpy
il      $2,      MFC_PUT_CMD
wrch    $ch21,   $2




             Conclusion: If you can call memcpy, you can DMA data.
              Introduction
• But what about DMA'ing data?
  – But with more control about how and
    when it's sent, retrieved.
                               SPU Synchronization

             Example Sync                         Fence: Transfer after previous
                                                           with the same tag
DMA from main ram to local store
                                        PUTF    Transfer   previous     before   this   PUT
                                        PUTLF   Transfer   previous     before   this   PUT LIST
                                        GETF    Transfer   previous     before   this   GET
                                        GETLF   Transfer   previous     before   this   GET LIST


Do other productive work while DMA
is happening...
                                                Barrier: Transfer after previous and
                                                   before next with the same tag

                                        PUTB    Fixed   order   with   respect   to   this   PUT
                                        PUTLB   Fixed   order   with   respect   to   this   PUT LIST
                                        GETB    Fixed   order   with   respect   to   this   GET
                                        GETLB   Fixed   order   with   respect   to   this   GET LIST

(Sync) Wait for DMA to complete

il      $2,      1
shl     $2,      $2, dma_tag                        Lock Line Reservation
wrch    $ch22,   $2
il      $3,      MFC_TAG_UPDATE_ALL
wrch    $ch23,   $3
rdch    $2,      $ch24                  GETLLAR Gets locked line. (PPU: lwarx, ldarx)
                                        PUTLLC Puts locked line. (PPU: stwcx, stdcx)
             Introduction
• Bottom line: SPUs are like most CPUs
  – Basics are pretty much the same.
  – Good data design decisions and smart
    code choices see benefits any platform
  – Good DMA pattern also means cache
    coherency. Better on every platform
  – Bad choices may work on some, but not
    others.
  – Xbox 360, PC, Wii, DS, PSP, whatever.
             Introduction
• And that's what we're talking about today.
  – Trying to apply smart choices to these
    particular CPUs for our games.
     • That's what console development is.
  – What mistakes we've made along the
    way.
  – What's worked best.
    Understanding the SPUs
• Rule 1: The SPU is not a co-processor!
  – Don't think of SPUs as hiding time
    “behind” a main PPU loop
     Understanding the SPUs
• What “clicked” with some Insomniacs
  about the SPUs:
   – “Everything is local”
   – “Think streams of data”
   – “Forget conventional OOP”
   – “Everything is a quadword”
   – “si intrinsics make things clearer”
   – “Local memory is really, really fast”
     Designing for the SPUs
• The ultimate goal: Get everything on the
  SPUs.
  – Leave the PPU for shuffling stuff around.
• Complex systems can go on the SPUs
  – Not just streaming systems
  – Used for any kind of task
  – But you do need to consider some
    things...
          Designing for the SPUs
• Data comes first.
  – Goal is minimum energy for
    transformation.
  – What is energy usage? CPU time.
    Memory read/write time. Stall time.


  Input          Transform()         Output
     Designing for the SPUs

• Design the transformation pipeline back to
  front.
   – Start with your destination data and
     work backward.
   – Changes are inevitable. This way you
     pay less for them.
   – An example...
                 Front to Back                                                      Back to Front

                                    Started Here                  Rendered Dynamic Geometry
Simulate Glass                                                       using Fake Mesh Data                      Render


                         Had a really nice looking simulation
Generate Crack              but would find out soon that           Faked Inputs to Triangulate
  Geometry                   This stage was worthless             and output transformed data to           igTriangulate
                                                                          render stage



 igTriangulate                Then wrote igTriangulate            wrote the simulation to provide
                                                                useful (and expected) results to the     Simulate Glass
                           Oops, the only possible output               triangulation library.
                         didn’t support the “glamorous” crack
                                      rendering
     Render                Realized that the level of detail
                                                                 •Could have avoided re-writing the simulation if the design
                             from the simulation wasn’t
                                                                   process was done in the correct order.
                           necessary considering that the
                                                                 •Good looking results were arrived at with a much smaller
                                granularity restrictions
                                    (memory, cpu)‫‏‬                 processing and memory impact.
 The rendering part                                              •Full simulation turned out to be un -necessary since it’s
of the pipeline didn’t           Could not support it.
                                                                  outputs weren’t realistic considering the restrictions of the
 completely support                                               final stage.
                          Even worse, the inputs that were
  the outputs of the                                             •Proof that “code as you design” can be disasterous.
                         being provided to the triangulation
triangulation library                                            •Working from back to front forces you to think about your
                         library weren’t adequate. Needed
                          more information about retaining        pipeline in advance. It’s easier to fix problems that live in
                                  surface features.               front of final code. Wildly scattered fixes and data format
                                                                  changes will only end in sorrow.
      Designing for the SPUs

• The data the SPUs will transform is the
  canonical data.
• i.e. Store the data in the best format for the
  case that takes the most resources.
     Designing for the SPUs

• Minimize synchronization
  – Start with the smallest synchronization
    method possible.
     Designing for the SPUs

• Simplest method is usually lock-free single
  reader, single writer queue.
PPU Ordered Write   SPU Ordered Write




   Write Data          Write Data




     lwsync
                     Increment Index
                       (with Fence)
 Increment Index
      Designing for the SPUs
• Fairly straightforward to load balance
  – For constant time transforms, just divide
    into multiple queues
  – For other transforms, use heuristic to
    decide times and a single entry queue to
    distribute to multiple queues.
      Designing for the SPUs

• Then work your way up.
  – Is there a pre-existing sync point that will
    work? (e.g. vsync)‫‏‬
  – Can you split your data into need-to-
    sync and don't-care?
                                                                                      Resistance2
                    Resistance : Fall of Man
                                                                          Immediate & Deferred Effect Updates +
                  Immediate Effect Updates Only                                  Reduced Sync Points

                  PPU                              SPU                           PPU                                SPU

                                                                Sync Immediate Updates For Last Frame
                                                                  Run Def erred Ef f ect Update/Render




         Update Game Objects
                                                                        Update Game Objects              Def erred Update & Render




                                                                        Sync Def erred Updates
     Run Immediate Ef f ect Updates
                                                                      Post Update Game Objects

                                             Immediate Update        Run Ef f ects System Manager
Finish Frame Update & Start Rendering
                                                                Finish Frame Update & Start Rendering         System Manager

    Sync Immediate Ef f ect Updates
                                                                     Sync Ef f ect System Manager
                                                                 Run Immediate Ef f ect Update/Render
Generate Push Buf f er To Render Frame
                                                                                                         Immediate Update & Render
                                                                Generate Push Buf f er To Render Frame      (Can run past end of
                                                                                                             PPU Frame due to
Generate Push Buf f er To Render Ef f ects                                                                  reduced sync points)‫‏‬

        Finish Push Buf f er Setup                                     Finish Push Buf f er Setup
     = PPU time overlapping ef f ects SPU time
     = PPU time spent on ef f ect system
     = PPU time that cannot be overlapped
                                                             Resistance : Fall of Man
                                                           Immediate Effect Updates Only


                                                           PPU                             SPU




                                                  Update Game Objects



 No ef f ects can be updated till all
 game objects have updated so
     attachments do not lag.

    Visibility and LOD culling
done on PPU bef ore creating jobs.

Each ef f ect is a separate SPU job          Run Immediate Ef f ect Updates
                                                                                                        Ef f ect updates running on
                                                                                                        all available SPUs (four)‫‏‬
                                                                                     Immediate Update
                                        Finish Frame Update & Start Rendering


    Likely to stall here , due to           Sync Immediate Ef f ect Updates
     limited window in which
       to update all ef f ects.
                                        Generate Push Buf f er To Render Frame




The number of ef f ects that could      Generate Push Buf f er To Render Ef f ects
 render were limited by available
 PPU time to generate their PBs.
                                                 Finish Push Buf f er Setup
    = PPU time overlapping ef f ects SPU time
    = PPU time spent on ef f ect system
    = PPU time that cannot be overlapped                       Resistance2
                                                   Immediate & Deferred Effect Updates +
                                                          Reduced Sync Points

                                                          PPU                              SPU

                                       Sync Immediate Updates For Last Frame

Initial PB allocations done on PPU       Run Def erred Ef f ect Update/Render
                                                                                                            Huge amount of previously unused
   Single SPU job f or each SPU
   (Anywhere f rom one to three)‫‏‬                                                                            SPU processing time available.

                                                                                                             Def erred ef f ects are one f rame
                                                                                                             behind, so ef f ects attached to
                                                                                                            moving objects usually should not
                                                                                                                       be def erred.
                                                 Update Game Objects            Def erred Update & Render




                                                Sync Def erred Updates

                                             Post Update Game Objects                                       SPU manager handles all visibility
                                                                                                             and LOD culling previously done
                                            Run Ef f ects System Manager                                              on the PPU.
                                       Finish Frame Update & Start Rendering         System Manager           Generates lists of instances f or
                                                                                                                 update jobs to process.
                                            Sync Ef f ect System Manager

Initial PB allocations done on PPU      Run Immediate Ef f ect Update/Render                                Immediate updates are allowed to
   Single SPU job f or each SPU                                                                               run till the beginning of the next
   (Anywhere f rom one to three)‫‏‬                                                                               f rame, as they do not need
                                                                                Immediate Update & Render       to sync to f inish generating
                                                                                                                         this f rame’s PB
 Doing the initial PB alloc on the     Generate Push Buf f er To Render Frame      (Can run past end of
PPU eliminates need to sync SPU                                                     PPU Frame due to
updates bef ore generating f ull PB.                                                                        Smaller window available to update
                                                                                   reduced sync points)‫‏‬     immediate ef f ects, so only ef fects
                                                                                                            attached to moving objects should
                                                Finish Push Buf f er Setup                                             be immediate.
     Designing for the SPUs

• Write “optimizable” code.
  – Often “optimized” code can wait a bit.
  – Simple, self-contained loops
     • Over as many iterations as possible
     • No branches
     Designing for the SPUs

• Transitioning from "legacy" systems...
  – We're not immune to design problems
  – Schedule, manpower, education, and
    experience all play a part.
      Designing for the SPUs

• Example from RCF...
  – FastPathFollowers C++ class
  – And it's derived classes
  – Running on the PPU
  – Typical Update() method
     • Derived from a root class of all
       “updatable” types
     Designing for the SPUs

• Where did this go wrong?
• What rules where broken?
  – Used domain-model design
  – Code “design” over data design
  – No advatage of scale
  – No synchronization design
  – No cache consideration
     Designing for the SPUs

• Result:
  – Typical performance issues
  – Cache misses
  – Unnecessary transformations
  – Didn't scale well
  – Problems after a few hundred updating
     Designing for the SPUs

• Step 1: Group the data together
  – “Where there's one, there's more than
    one.”
  – Before the update() loop was called,
  – Intercepted all FastPathFollowers and
    derived classes and removed them from
    the update list.
  – Then kept in a separate array.
     Designing for the SPUs

• Step 1: Group the data together
  – Created new function,
    UpdateFastPathFollowers()‫‏‬
  – Used the new list of same type of data
  – Generic Update() no longer used
  – (Ignored derived class behaviors here.)‫‏‬
     Designing for the SPUs

• Step 2: Organize Inputs and Outputs
  – Define what's read, what's write.
  – Inputs: Position, Time, State, Results of
    queries, Paths
  – Outputs: Position, State, Queries,
    Animation
  – Read inputs. Transform to Outputs.
     • Nothing more complex than that.
     Designing for the SPUs

• Step 3: Reduce Synchronization Points
  – Collected all outputs together
  – Collected any external function calls
    together into a command buffer
     • Separate Query and Query-Result
     • Effectively a Queue between systems
  – Reduced from many sync points per
    “object” to one sync point for the system
     Designing for the SPUs

• Before Pattern:
  – Loop Objects
     • Read Input 0
     • Update 0
     • Write Output
     • Read Input 1
     • Update 1
     • Call External Function
     • Block (Sync)‫‏‬
     Designing for the SPUs

• After Pattern (Simplified)‫‏‬
  – Loop Objects
     • Read Input 0, 1
     • Update 0, 1
     • Write Output, Function to Queue
  – Block (Sync)‫‏‬
  – Empty (Execute) Queue
     Designing for the SPUs

• Next: Added derived-class functionality
• Similarly simplified derived-class Update()
  functions into functions with clear inputs
  and outputs.
• Added functions to deferred queue as any
  other function.
• Advantage: Can limit derived functionality
  based on count, LOD, etc.
     Designing for the SPUs

• Step 4: Move to PPU thread
  – Now system update has no external
    dependencies
  – Now system update has no conflicting
    data areas (with other systems)‫‏‬
  – Now system update does not call non-
    re-entrant functions
  – Simply put in another thread
     Designing for the SPUs

• Step 4: Move to PPU thread
  – Add literal sync between system update
    and queue execution
  – Sync can be removed because only
    single reader and single writer to data
     • Queue can be emptied while being
       filled without collision
     • See: R&D page on multi-threaded
       optimization
     Designing for the SPUs

• Step 5: Move to SPU
  – Now completely independent thread
  – Can be run anytime
  – Prototype for new SPU system
     • AsyncMobyUpdate
     • Using SPU Shaders
     Designing for the SPUs

• Transitioning from “SPU as coprocessor”
  model.
• Example: igPhysics from Resistance to
  now...
                    PPU                   SPU                         Execution
                 Environment
Resistance:       Pre-Update                                           AABB Tests
Fall of Man   (Resolve Anim+IK)
                                                                   Triangle Intersection
 Physics                                                           Sphere, Capsule, etc..
                 Environment
 Pipeline
                   Update            Note: One Job                 Pack contact points
                                      Per Object.
                                   (box, ragdoll, etc..)‫‏‬
               Collision Update
               (Start Coll Jobs
               while building)        Collide Prims
                                   (generate contacts)
                Sync Collision
 *Blocked!     Jobs and Process
                Contact Points

               Associate Rigid
               Bodies Through                                      Unpack Constraints
                 Constraints
                                                                  Generate Jacobian Data

              Package Rigid Body                                    Solve Constraints
                     Pools.
                                                                  Pack Rigid Body Data
                (Start SPU Jobs
                While packing)          Simulate

                Sync Sim Jobs
 *Blocked!     and Process Rigid                            *The only time hidden between start and
                  Body Data                                  stop of jobs is the packing of job data.
                                                            The only other savings come from merely
                  Post Update                                    running the jobs on the SPU.
               (Transform Anim
                    Joints)
                   PPU                SPU                Execution

                Environment                            Upload Tri-Cache
Resistance 2      Update            Upload
                                                       Upload RB Prims
                                  Object Cache
  Physics                                            Upload Intersect Funcs
  Pipeline     Triangle Cache
                   Update                              Intersection Tests
                                For Each Iteration
                                                       Upload CO Prims
               Object Cache
                 Update              Collide         Upload Intersect Funcs
                                    Triangles
                                                       Intersection Tests
                                     Collide
                   Start            Primitives
                Physics Jobs                            Sort Joint Types
                                 Upload Physics      Per Joint Type Upload
                                     Joints           Jacobian Generation
                                                              Code
                                     Build             Calculate Jacobian
                 PPU Work       Simulation Pools              Data
                                                        Solve Constraints
                                 Upload Solver
                                                           Integrate
                                     Code
                                                       For Each Physics
                                    Simulate
                   Sync                                  Object Upload
                                      Pools
                Physics Jobs                              Anim Joints
                                                     Transform Anim Joints
                                                     Using Rigid Body Data
                                  Post Update
                 Update                               Send Update To PPU
               Rigid Bodies
        Optimizing for SPUs

• Instruction-level optimizations are similar
  to any other platform
   – i.e. Look at the instruction set and write
     code that takes advantage of it.
       Optimizing for SPUs
• Memory transfer optimizations are similar
  to any other platform
   – i.e. Organize data for line-length and
     coherency. Separate read and write
     buffers wherever possible.
   – DMA is exactly like cache pre-fetch
        Optimizing for SPUs

• Local memory optimizations are similar to
  any other platform
   – i.e. Have a fixed-size buffer, split it into
     smaller buffers for input, output,
     temporary data and code.
   – Organizing 256K is essentially the same
     process as organizing 256M
       Optimizing for SPUs

• Memory layout
  – Memory is dedicated to your code.
  – Memory is local to your code.
  – Design so you know what will read and
    write to the memory
     • i.e. DMAs from PPU, other SPUs, etc.
  – Generally fairly straightforward.
  – Remember you can use an offline tool to
    layout your memory if you want.
       Optimizing for SPUs
• Memory layout
  – But never, ever try to use a dynamic
    memory allocator.
     • Malloc for dedicated 256K would be
       ridiculous.
     • OK. Malloc in a console game would
       be ridiculous.
        Optimizing for SPUs
• Memory layout
  – Rules of thumb:
     • Organize everything into blocks of
       16b.
        –SPU Reads/Writes only 16b
     • Group same fields together
        – No “single object” data
        – Similar to most SIMD.
        – Similar to GPUs.
        Optimizing for SPUs
• Memory transfer
  – Usually pretty straightforward
  – Rules of thumb:
     • Keep everything 128b aligned
         – Nothing different. Same rule as the
          PPU. (Cache-line is 128b)‫‏‬
     • Transfer as much data as possible
       together. Transform together.
         – Nothing different. Same rule as the
          PPU. (For cache coherency)‫‏‬
        Optimizing for SPUs
• Memory transfer
  – Let's dig in to these “rules of thumb” a
    bit...
  – Shared alignment between main ram
    and SPU local memory is going to be
    faster. (So pick an alignment and stick
    with it.)‫‏‬
  – Transfer is done in 128b blocks, so
    alignment isn't strictly necessary (but no
    worries about above if it is)‫‏‬
       Optimizing for SPUs
• Number of transfers doesn't really matter
  (re: Biggest transfers possible) but...
   – You want transfer 128b blocks, not
     scattered.
   – You want to minimize synchronization
     (sync on less dma tags)‫‏‬
   – You have less places to worry about
     alignment.
   – You want to minimize scatter/gather.
     Especially considering TLB misses.
        Optimizing for SPUs
• Memory transfer
  – Rules of thumb:
     • If scattered reads, writes are
       necessary, use DMA list (not
       individual DMAs)‫‏‬
         –Advantage over PPU. PPU can't do
           out-of-order, grouped memory
           transfer.
         –Keeps predictability of in-order
           execution with performance of out-
           of-order memory transfer.
        Optimizing for SPUs
• Speaking of out-of-order transfers...
  – Use DMA fence to dictate order
  – Reads and write are interleaved,
     • If you need max transfer performance,
       issue them separately.
        Optimizing for SPUs
• Memory transfer
  – Double, Triple buffer optimization
  – (Show fence example)‫‏‬
       Optimizing for SPUs
• Code level optimization
  – Rules of thumb:
     • Know the instruction set
     • Use si intrinsics (or asm)‫‏‬
     • Stick with native types
        – Clue: There's only one (qword)‫‏‬
       Optimizing for SPUs
• Code level optimization
  – Rules of thumb:
  – Code branch free
     • Not just for branch performance.
     • Branch free scalar transforms to SIMD
       extremely well.
  – There is a hitch. No SIMD loads or
    stores.
     • This drives data design decisions.
       Optimizing for SPUs
• Code level optimization
  – Examples...
       Optimizing for SPUs
• Example 1: Vector-Matrix Multiply
                           Vector-Matrix Multiplication

                                         Standard Approach
                                                                 The general case:
     Multiplying a vector (x,y,z,w) by a 4x4 matrix
                                                                 shufb   xxxx, xyzw, xyzw,   shuf_AAAA
      (x’ y’ z’ w’) = (x y z w) * (m00 m01 m02 m03)‫‏‬             shufb   yyyy, xyzw, xyzw,   shuf_BBBB
                                                                 shufb   zzzz, xyzw, xyzw,   shuf_CCCC
                                  (m10 m11 m12 m13)‫‏‬
                                                                 shufb   wwww, xyzw, xyzw,   shuf_DDDD
                                  (m20 m21 m22 m23)‫‏‬             fm      result, xxxx, m0
                                  (m30 m31 m32 m33)‫‏‬             fma     result, yyyy, m1,   result
                                                                 fma     result, zzzz, m2,   result
                                                                 fma     result, wwww, m3,   result
The result is obtained by multiplying the x by the first row     Case w=0:
of the matrix, y by the second, etc. and accumulating these
products. This observation leads to the standard method:         shufb   xxxx, xyz0, xyz0,   shuf_AAAA
                                                                 shufb   yyyy, xyz0, xyz0,   shuf_BBBB
Broadcast each of the x,y,z and w across all 4 components,       shufb   zzzz, xyz0, xyz0,   shuf_CCCC
then perform 4 multiply-add type instructions. Abbreviated       fm      result, xxxx, m0
versions are possible in the special cases of w=0 and w=1,       fma     result, yyyy, m1,   result
which occur frequently.                                          fma     result, zzzz, m2,   result

                                                                 Case w=1:
All 3 versions are shown to the right.
                                                                 shufb   xxxx, xyz1, xyz1,   shuf_AAAA
It’s a simple matter to extend this approach to the product of   shufb   yyyy, xyz1, xyz1,   shuf_BBBB
two 4x4 matrices. Note that the w=0 and w=1 cases come           shufb   zzzz, xyz1, xyz1,   shuf_CCCC
into play here when our matrices have (0,0,0,1)T in the          fma     result, xxxx, m0,   m3
rightmost column.                                                fma     result, yyyy, m1,   result
                                                                 fma     result, zzzz, m2,   result
                   Vector-Matrix Multiplication

                            Faster Alternatives




    In the simple case where we only wish to transform a single vector,
or multiply a single pair of matrices, the standard approach that was shown
would be most appropriate. But frequently we’ll have a collection of vectors
  or matrices which we wish to multiply by the same matrix, in which case
    we may be prepared to make sacrifices for the sake of reducing the
                               instruction count.
                              Vector-Matrix Multiplication

                                                Alternative 1

           By simply preswizzling the matrix, we can reduce the number of shuffles needed

The general case:
                                                            Case w=0, with (0,0,0,1)T in the rightmost matrix column:
Preswizzle the matrix as: (m00 m11 m22 m33)‫‏‬
                          (m10 m21 m32 m03)‫‏‬                Preswizzle the matrix as: (m00, m11, m22, 0)‫‏‬
                          (m20 m31 m02 m13)‫‏‬                                          (m10, m21, m02, 0)‫‏‬
                          (m30 m01 m12 m23)‫‏‬                                          (m20, m01, m12, 0)‫‏‬
then transform a vector using the sequence:
  rotqbyi yzwx, xyzw, 4                                     This can be done efficiently using selb:
  rotqbyi zwxy, xyzw, 8                                       fsmbi mask_0F00, 0x0F00
  rotqbyi wxyz, xyzw, 12                                      fsmbi mask_00F0, 0x00F0
  fm      result, xyzw, m0_                                   selb m0_, m0, m1, mask_0F00
  fma     result, yzwx, m1_, result                           selb m1_, m1, m2, mask_0F00
  fma     result, zwxy, m2_, result                           selb m2_, m2, m0, mask_0F00
  fma     result, wxyz, m3_, result                           selb m0_, m0_, m2, mask_00F0
                                                              selb m1_, m1_, m0, mask_00F0
Case w=1, with (0,0,0,1)T in the rightmost matrix column:
                                                              selb m2_, m2_, m1, mask_00F0
Use the same preswizzle as the w=0 case,
leaving row 3 unchanged.                                    The vector multiply then only
Again 5 instructions suffice:                               requires 5 instructions:
                                                              shufb yzx0, xyz0, xyz0, shuf_BCA0
  shufb   yzx0, xyz0, xyz0, shuf_BCA0                         shufb zxy0, xyz0, xyz0, shuf_CAB0
  shufb   zxy0, xyz0, xyz0, shuf_CAB0                         fm    result, xyz0, m0_
  fma     result, xyz0, m0_, m3                               fma   result, yzx0, m1_, result
  fma     result, yzx0, m1_, result                           fma   result, zxy0, m2_, result
  fma     result, zxy0, m2_, result
                              Vector-Matrix Multiplication

                                               Alternative 2

        If we’re dealing with the general case, we can reduce the instruction count further still

Using the preswizzle: (m02, m13, m20, m31)‫‏‬
                       (m12, m23, m30, m01)‫‏‬
                       (m00, m11, m22, m33)‫‏‬
                       (m10, m21, m32, m03)‫‏‬

we can carry out the vector multiply
in just 6 instructions:                                    This approach yields no additional
                                                       benefits for the w=0 and w=1 cases however.
rotqbyi   yzwx, xyzw, 4
fm        temp, xyzw, m0_
fma       temp, yzwx, m1_, temp
rotqbyi   result, temp, 8
fma       result, xyzw, m2_, result
fma       result, yzwx, m3_, result


                                               Conclusion

Single vector/matrix times a single matrix: use the Standard Approach.
    Many vectors/matrices times a single matrix: use Alternative 1.
   Many general vectors/matrices (i.e. anything in w) times a single matrix
                 in a pipelined loop: use Alternative 2.
       Optimizing for SPUs
• Example 2: Matrix Transpose
                             Matrix Transposition

                                  Standard Approach

       A general 4x4 matrix can be transposed in 8 shuffles as follows

                  (x0,     y0,   z0,   w0)    (x0, x1,       x2,    x3)‫‏‬
                  (x1,     y1,   z1,   w1) -> (y0, y1,       y2,    y3)‫‏‬
                  (x2,     y1,   z2,   w2)    (z0, z1,       z2,    z3)‫‏‬
                  (x3,     y3,   z3,   w3)    (w0, w1,       w2,    w3)‫‏‬

       shufb   t0,   a0,   a2,   shuf_AaBb     //   t0   =   (x0,   x2,   y0,   y2)‫‏‬
       shufb   t1,   a1,   a3,   shuf_AaBb     //   t1   =   (x1,   x3,   y1,   y3)‫‏‬
       shufb   t2,   a0,   a2,   shuf_CcDd     //   t2   =   (z0,   z2,   w0,   w2)‫‏‬
       shufb   t3,   a1,   a3,   shuf_CcDd     //   t3   =   (z1,   z3,   w1,   w3)‫‏‬
       shufb   b0,   t0,   t1,   shuf_AaBb     //   b0   =   (x0,   x1,   x2,   x3)‫‏‬
       shufb   b1,   t0,   t1,   shuf_CcDd     //   b1   =   (y0,   y1,   y2,   y3)‫‏‬
       shufb   b2,   t2,   t3,   shuf_AaBb     //   b2   =   (z0,   z1,   z2,   z3)‫‏‬
       shufb   b3,   t2,   t3,   shuf_CcDd     //   b3   =   (w0,   w1,   w2,   w3)‫‏‬

Many variations are possible by changing the particular shuffles used, but they all end
 up doing the same thing in the same amount of work. The version shown above is a
                 good choice because it only requires two constants.
                        Matrix Transposition

                                 Faster 4x4

By using a different set of shuffles, a couple of the shuffles can then be
           replaced by select-bytes which has lower latency



  shufb    t0,   a0,   a1,   shuf_AaCc      //   t0   =   (x0,   x1,   z0,   z1)‫‏‬
  shufb    t1,   a2,   a3,   shuf_CcAa      //   t1   =   (z2,   z3,   x2,   x3)‫‏‬
  shufb    t2,   a0,   a1,   shuf_BbDd      //   t2   =   (y0,   y1,   w0,   w1)‫‏‬
  shufb    t3,   a2,   a3,   shuf_DdBb      //   t3   =   (w2,   w3,   y2,   y3)‫‏‬
  shufb    b2,   t0,   t1,   shuf_CDab      //   b2   =   (z0,   z2,   z2,   z2)‫‏‬
  shufb    b3,   t2,   t3,   shuf_CDab      //   b3   =   (w0,   w3,   w3,   w3)‫‏‬
  selb     b0,   t0,   t1,   mask_00FF      //   b0   =   (x0,   x0,   x0,   x0)‫‏‬
  selb     b1,   t2,   t3,   mask_00FF      //   b1   =   (y0,   y1,   y1,   y1)‫‏‬




This version is quicker by 1 cycle, at the expense of requiring more constants
                          Matrix Transposition

                                    3x4 -> 4x3


               Here is an example that uses only 6 shuffles


               (x0, y0, z0, w0)                  (x0,      x1,   x2,   0)‫‏‬
               (x1, y1, z1, w1)            ->    (y0,      y1,   y2,   0)‫‏‬
               (x2, y2, z2, w2)                  (z0,      z1,   z2,   0)‫‏‬
                                                 (w0,      w1,   w2,   0)‫‏‬

   shufb     t0,   a0,   a1,   shuf_AaBb         //   t0   =   (x0,   x1,   y0,   y1)‫‏‬
   shufb     t1,   a0,   a1,   shuf_CcDd         //   t1   =   (z0,   z1,   w0,   w1)‫‏‬
   shufb     b0,   t0,   a2,   shuf_ABa0         //   b0   =   (x0,   x1,   x2,   0)‫‏‬
   shufb     b1,   t0,   a2,   shuf_CDb0         //   b1   =   (y0,   y1,   y2,   0)
   shufb     b2,   t1,   a2,   shuf_ABc0         //   b2   =   (z0,   z1,   z2,   0)‫‏‬
   shufb     b3,   t1,   a2,   shuf_CDd0         //   b3   =   (w0,   w1,   w2,   0)‫‏‬


Note that care must be taken if the destination matrix is the same as the source.
              In this case the last 2 lines of code must be swapped
                        to avoid prematurely overwriting a2.
                     Matrix Transposition

                               3x3


          Here is an example that uses only 5 shuffles



        (x0, y0, z0, w0)             (x0, x1, x2, 0)‫‏‬
        (x1, y1, z1, w1)        ->   (y0, y1, y2, 0)‫‏‬
        (x2, y1, z2, w2)             (z0, z1, z2, 0)‫‏‬

shufb   t0,   a0,   a1,a shuf_AaBb    // t0 = (x0, x1, y0, y1)‫‏‬
shufb   t1,   a0,   a1, shuf_CcDd    // t1 = (z0, z1, w0, w1)‫‏‬
shufb   b0,   t0,   a2, shuf_ABa0    // b0 = (x0, x1, x2, 0)‫‏‬
shufb   b1,   t0,   a2, shuf_CDb0    // b1 = (y0, y1, y2, 0)
shufb   b2,   t1,   a2, shuf_ABc0    // b2 = (z0, z1, z2, 0)‫‏‬
                           Matrix Transposition

                             3x3 (reduced latency)‫‏‬

If we seek the lowest latency, this example is 2 cycles quicker than the last
    example, at the expense of an extra instruction and an extra constant


              (x0, y0, z0, w0)               (x0, x1, x2, 0)‫‏‬
              (x1, y1, z1, w1)          ->   (y0, y1, y2, 0)‫‏‬
              (x2, y1, z2, w2)               (z0, z1, z2, 0)‫‏‬

     shufb     t0,   a1,   a2,   shuf_0Aa0     //   t0   =   ( 0,   x1,   x2,   0)‫‏‬
     shufb     t1,   a2,   a0,   shuf_b0B0     //   t1   =   (y0,    0,   y2,   0)‫‏‬
     shufb     t2,   a0,   a1,   shuf_Cc00     //   t2   =   (z0,   z1,    0,   0)‫‏‬
     selb      b0,   a0,   t0,   mask_0FFF     //   b0   =   (x0,   x1,   x2,   0)‫‏‬
     selb      b1,   a1,   t1,   mask_F0FF     //   b1   =   (y0,   y1,   y2,   0)‫‏‬
     selb      b2,   a2,   t2,   mask_FF0F     //   b2   =   (z0,   z1,   z2,   0)‫‏‬



  Hybrid versions are also possible, which may be of use when trying to balance
                              even vs. odd counts.
       Optimizing for SPUs
• Example 3: 8 bit palette lookup
  – Flip the problem around
  – Instead of looking up index for each
    byte...
  – Loop through the palette and compare
    each quadword of indices and mask any
    matching results
       Optimizing for SPUs
• When is it better to use asm?
  – When you know facts the compiler
    cannot (and can take advantage of
    them)‫‏‬
  – i.e. almost always.
        Optimizing for SPUs
• When is asm really worth it?
  – Case-by-case.
     • Time, experience, performance,
       practice.
• Doesn't it make the code unmaintainable?
  – Not much different from using intrinsics.
  – Especially if you use macro-asm tools.
  – e.g. for register coloring - that´s really
    the tedious part of editing asm.
       Optimizing for SPUs
• Writing asm rules-of-thumb:
  – Minimize instruction count
  – Minimize trace latency
  – (Instruction count takes precedence)‫‏‬
  – Balance even/odd instruction pipelines
  – Minimize memory accesses
     • Can block DMA or instruction fetch
         The 256K Barrier
• The solution is simple:
  – Upload more code when you need it.
  – Upload more data when you need it.
• Data is managed by traditional means
  – i.e. Double, triple fixed-buffers, etc.
• Code is just data.
  – Can we manage code the same way we
    manage data?
             SPU Shaders
• SPU Shaders are:
  – Fragments of code used in existing systems
    (Physics, Animation, Effects, AI, etc.)‫‏‬
  – Code is loaded at location pre-determined by
    system.
  – Custom (Data/Interface) for each system.
  – An expansion of an existing system (e.g.
    Pipelined stages)‫‏‬
  – Custom modifications of system data.
  – Way of delivering feedback to other systems
    outside the scope of the current system.
             SPU Shaders
• SPU Shaders are NOT:
  – Generic, general purpose system.
  – A system of any kind, actually.
  – Globally scheduled.
                 SPU Shaders
• Why is it called a “shader”?
  – Shares important similarities to GPU shaders.
     •   Native code fragments
     •   Part of a larger system
     •   In-context execution
     •   Independently optimizable
  – Most important: Concept is approachable.
              SPU Shaders
• “Don't try to solve everyone's problems”
  – Solutions that try to solve all problems tend to
    cause more problems than they solve.‫‏‬
             SPU Shaders
• Easy to Implement
  – Pick stage(s) in system kernel to inject
    shaders.
  – Define available inputs and outputs.
  – Collect common functions.
  – Compile shaders as data.
  – Sort instance data based on shader type(s)‫‏‬
  – Load shader on-demand based on data
    select.
  – Call shaders.
             SPU Shaders
• What data is being transformed?
  – What are the inputs?
  – What are the outputs?
  – What can be modified?
              SPU Shaders
• Collect the common functions...
  – Always loaded by the system
  – e.g.
     • Dma wrapper functions
     • Debugging functions
     • Common transformation functions
               Example Structure Passed to Shader




struct   common_t
{
  void   (*print_str)(const char *str);
  void   (*dma_wait)(uint32_t tag);
  void   (*dma_send)(void *ls, uint32_t ea, uint32_t size, uint32_t tag);
  void   (*dma_recv)(void *ls, uint32_t ea, uint32_t size, uint32_t tag);

  char*      ls;
  uint32_t   ls_size;
  uint32_t   data_ea;
  uint32_t   data_size;
  uint32_t   dma_tags[2];
};
             SPU Shaders
• System Shader Configuration...
  – System knows where the fragments are.
  – System knows when to call the fragments.
  – System doesn't know what the fragments do.
  – Fragments are in main RAM.
  – Fragments don't need to be fixed.
                 SPU Shaders
• System Shader Configuration.
• Manage fragment memory:
  – Simplest method:
    •   Double buffer,
    •   On-demand,
    •   Fixed maximum size,
    •   By-index from array,...
             SPU Shaders
• Create the shader code...
• “Code is just data”
  – No special distinquishing feature on the SPUs
• Overlays or additional jobs are too
  complex and heavyweight.
  – Just want load and execute.
  – No special system needed.
                SPU Shaders
• Create the shader code..
  – Method 1: Shader as PPU header
    •   Compile shader as normal, to obj file.
    •   Dump obj file using spu-objdump
    •   Convert dump to header using script.
    •   This is what we started with
               SPU Shaders
• Create the shader code..
  – Method 2: Use elf file
     • Requires extra compile step, but more debugger
       friendly.
     • This is what we're doing now.


  – Other methods too, use whatever works for
    you.
            SPU Shaders
• Calling the shader...
• Nothing could be easier.
  – ShaderEntry* shader = (addr of
    fragment);
  – shader( data, common );
              SPU Shaders
• Debugging Shaders...
  – Fragments are small
  – Fragments have well defined inputs and
    outputs.
  – Ideal for unit tests in separate framework.
  – Test on PS3/Linux box.
• Alternatives:
  – Debug on PPU (intrinsics are portable)‫‏‬
  – Temporarily link in shader.
             SPU Shaders
• Runtime debugging:
  – Is a problem with the first method.
  – Using the full elf, have debugging info
  – Now works transparently in our debugger.
              SPU Shaders
• Rule 1: Don't Manage Data for Shaders
  – Just give shaders a buffer and fixed size.
  – Shaders should depend on size, so leave
    room for system changes.
  – Best size depends on system.
     • (Maybe 4K, maybe 32K)‫‏‬
  – Don't read or write from/to shader buffer.
              SPU Shaders
• System-specific
  – Multiple list of instances to modify or
    transform
  – Context data
• Shader-internal (“local”)‫‏‬
  – EA passed by system
  – Fixed buffer
• Shader shared (“global”)‫‏‬
  – EA passed by system
              SPU Shaders
• Rule 2: Don't Manage DMA for Shaders
  – Give fixed number of DMA tags to shader
    • Grab them in the entry function and pass down)‫‏‬
    • Avoid: GetDmaTagFromParentSystem()‫‏‬
  – Give DMA functions to shaders
    • To allow system to run with any job manager, or
      none
  – Don't use shader tags for other purposes
              SPU Shaders
• Rule 3: Enforce fixed maximum size for
  Shader code.
  – System can be maintained.

• Rule 4: Shaders are always called in a
  clear, well defined context.
  – i.e. Part of a larger system.‫‏‬
             SPU Shaders
• Rule 5: Fixed parameter list for shaders,
  per-system (or sub-system)‫‏‬
  – Don't want to re-compile all shaders.
  – Don't want to manage dynamic parameter
    lists.

• Rule 6: Shaders should be given as many
  instances as possible.
  – More optimizable.‫‏‬
             SPU Shaders
• Rule 7: Don't break the rules.
  – You'll end up with a new job manager.
  – You'll end up with a big headache.
             SPU Shaders
• Where are we using these?
  – Physics, Effects, Animation, Some AI Update
• Also experimenting with pre-vertex
  shaders on the SPUs
• And experimenting with giving some of
  that control to the artists (Directly
  generating code from a tool...)‫‏‬
               Conclusion
• It's not that complicated.
• Good data and good design works well on
  the SPUs (and will work well anywhere)‫‏‬
  – Sometimes you can get away with bad design
    and bad data on other platforms
  – ...for now. Bad design will not survive this
    generation.
• Lots of opportunities for optimization.
                Credits
• This was based on the hard work and
  dedication of the Insomniac Tech Team.
  You guys are awesome.

								
To top