Slide 1

Shared by: pX7oV0
Categories
Tags
-
Stats
views:
0
posted:
1/29/2013
language:
English
pages:
32
Document Sample
scope of work template
							Emergent Game Technologies
Gamebryo Element Engine
             Thread for Performance
Goals for Cross-Platform Threading

•Play well with others
•Take advantage of platform-specific
 performance features
•For engines/middleware, be adaptable to
 the needs of customers




                                           2
Write Once, Use Everywhere

•Underlying multi-threaded primitives are
 replicated on all platforms
 – Define cross-platform wrappers for these
•Processing models can be applied on
 different architectures
 – Define cross-platform systems for these
•Typical developer writes once, yet code
 performs well on all platforms



                                              3
Emergent's Gamebryo Element

•A foundation for easing cross-platform and
 multi-core development
 –Modular, customizable
 –Suite of content pipeline tools
 –Supports PC, Xbox, PS3 and Wii
•Booth # 5716 - North Hall




                                              4
Cross-Platform Threading Requires
Common Primitives
•Threads
 – Something that executes code
 – Sub issues: local storage, priorities
•Data Locks / Critical sections
 – Manage contention for a resource
•Atomic operations
 – An operation that is guaranteed to complete
   without interruption from another thread




                                                 5
Choosing a Processing Model

•Architectural features drive choice
 – Cache coherence
 – Prefetch on Xbox
 – SPUs on PS3
 – Many processing units
 – General purpose GPU
•Stream Processing fits these properties
 – Provide infrastructure to compute this way
 – Shift engine work to this model



                                                6
Stream Processing (Formal)‫‏‬
Wikipedia: Given a set of input and output
data (streams), the principle essentially
defines a series of computer-intensive
operations (kernel functions) to be applied
for each element in the stream.
 Input 1
           Kernel 1


                                Kernel 2

                                           Output
                      Input 2


                                                    7
Generalized Stream Processing

•Improve for general purpose computing
 – Partition streams into chunks
 – Kernels have access to entire chunk
 – Parameters for kernels (fixed inputs)‫‏‬
•Advantages
 – Reduce need for strict data locality
 – Enables loops, non-SIMD processing
 – Maps better onto hardware




                                            8
Morphing+Skinning Example


 Morph Kernel                                                         Skin Vertices
 (MK)‫‏‬
                                                                      Bone Matrices   Skinning Kernel (SK)‫‏‬
                                                                      Blend Weights
  Morph Weights
                  Morph Target 1 Vertices
                                            Morph Target 2 Vertices




                                                                                            Vertex Locations
                                                                                                               9
Morphing+Skinning Example


MT 1 V Part 1
                               Skin V Part 1




                                                            Verts Part 1 Verts Part 2
MT 2 V Part 1       MK                             SK
                Instance 1                     Instance 1
                             Weights Fixed
   MW Fixed
                             Matrices Fixed
                    MK                             SK
MT 1 V Part 2   Instance 2                     Instance 2
                               Skin V Part 2
MT 2 V Part 2




                                                                        10
Floodgate

•Cross platform stream processing library
•Optimized per-platform implementation
•Documented API for customer use
•Engine uses the same API for built in
 functionality
 – Skinning, Morphing, Particles, Instance Culling,
   ...




                                                      11
Floodgate Basics

•Stream: A buffer of varying or fixed data
 – A pointer, length, stride, locking
•Kernel: An operation to perform on streams
 of data
 – Code implementing “Execute” function
•Task: Wrapper a kernel and IO streams
•Workflow: A collection of Tasks processed
 as a unit



                                             12
Kernel Example: Times2


     // Include Kernel Definition macros
     #include <NiSPKernelMacros.h>

     // Declare the Timer2Kernel
     NiSPDeclareKernel(Times2Kernel)‫‏‬




                                           13
Kernel Example: Times2
#include "Times2Kernel.h"
NiSPBeginKernelImpl(Times2Kernel)‫‏‬
{
    // Get the input stream
    float *pInput = kWorkload.GetInput<float>(0);
    // Get the output stream
    float *pOutput = kWorkload.GetOutput<float>(0);
    // Process data
    NiUInt32 uiBlockCount = kWorkload.GetBlockCount();
    for (NiUInt32 ui = 0; ui < uiBlockCount; ui++)‫‏‬
    {
        pOutput[ui] = pInput[ui] * 2;
    }
}
NiSPEndKernelImpl(Times2Kernel)‫‏‬

                                                    14
Life of a Workflow

•1. Obtain Workflow from Floodgate
•2. Add Task(s) to Workflow
•3. Set Kernel
•4. Add Input Streams
•5. Add Output Streams
•6. Submit Workflow
•… Do something else …
•7. Wait or Poll when results are needed


                                           15
Example Workflow
// Setup input and output streams from existing buffers
NiTSPStream<float> inputStream(SomeInputBuffer, MAX_BLOCKS);
NiTSPStream<float> outputStream(SomeOutputBuffer, MAX_BLOCKS);

// Get a Workflow and setup a new task for it
NiSPWorkflow* pWorkflow = NiStreamProcessor::Get()-
   >GetFreeWorkflow();
NiSPTask* pTask = pWorkflow->AddNewTask();

// Set the kernel and streams
pTask->SetKernel(&Times2Kernel);
pTask->AddInput(&inputStream);
pTask->AddOutput(&outputStream);

// Submit workflow for execution
NiStreamProcessor::Get()->Submit(pWorkflow);

// Do other operations...

// Wait for workflow to complete
NiStreamProcessor::Get()->Wait(pWorkflow);
                                                                 16
Floodgate Internals

•Partitioning streams for Tasks
•Task Dependency Analysis
•Platform specific Workflow preparation
•Platform specific execution
•Platform specific synchronization




                                          17
Overview of Workflow Analysis

•Task dependencies defined by streams
•Sort tasks into stages of execution
 – Tasks that use results from other tasks run in
   later stages
 – Stage N+1 tasks depend on output of Stage N
   tasks
•Tasks in a given stage can run concurrent
•Once a stage has completed, the next stage
 can run


                                                    18
Analysis: Workflow with many
Tasks

Stream A   Task 1   Stream B   Stream C   Task 2   Stream D   Stream E   Task 3   Stream F




                               Stream G   Task 5   Stream H

           Task 4   Stream G                                             Task 6   Stream I




                                          Task 7     Sync




                                                                                     19
Analysis: Dependency Graph

            Stage 0   Stage 1               Stage 2              Stage 3


 Stream A   Task 1

                      Task 4     Stream G   Task 5    Stream H



 Stream C   Task 2
                                                                  Sync     Sync
                                                                  Task



 Stream E   Task 3    Stream F              Task 6    Stream I




                                                                              20
Performance Notes

•Data is broken into blocks -> Locality
 – Good cache performance
 – Optimize size for prefetch or DMA transfers
 – Fits in limited local storage (PS3)‫‏‬
•Easily adapt to #cores
 – Can manage interplay with other systems
•Kernels encapsulate processing
 – Good target for optimization, platform-specific
 – Clean solution without #if



                                                     21
Usability Notes

•Automatically manage data dependency and
 simplify synchronization
•Hide nasty platform-specific details
 – Prefetch, DMA transfers, processor detection, ...
•Learn one API, use it across platforms
 – Productivity gains
    –Helps us produce quality documentation and
     samples
 – Eases debugging



                                                   22
Exploiting Floodgate in the Engine
•Find tasks that operate on a single object
  – Skinning, morphing, particle systems, ...
• Move these to Floodgate: Mesh Modifiers
  – Launch at some point during execution
     –After updating animation and bounds
     –After determining visibility
     –After physics finishes ...
  – Finish them when needed
     –Culling
     –Render
     –etc
                                                23
Same applications, new
performance ...
                     Before   After
  Skinning Objects   42fps    62fps

  Morphing Objects   12fps    38fps

•The big win is out-of-the-box
 performance
 – Same results could be achieved
   with much developer time
 – Hides details on different
   platforms (esp. PS3)‫‏‬


                                      24
Example CPU Utilization, Morphing
 Before




 After




                                25
Thread profiling, Morphing Before




•Some parallelization through hand-coded
 parallel update
 – Note high overhead and 85% or so in serial
   execution



                                                26
Thread profiling, Morphing After




•Automatic parallelism in engine
 –4 threads for Floodgate (4 CPUs)‫‏‬
 –Roughly, 50% of old serial time replaced
  with 4x parallelism


                                             27
New Issues

•Within the engine, resource usage peaks at
 certain times
 – e.g. Between visibility culling and rendering
 – Application-level work might fill in the empty
   spaces
    –Physics, global illumination, ...
•What about single processor machines?
•What about variable sized output?
 – Instance culling, for example



                                                    28
Ongoing Improvements

•Improved workflow scheduling
 – Mechanisms to enhance application control
•Optimizing when tasks change
 – Stream lengths change
 – Inputs/outputs are changed
•More platform specific improvements
•Off-loading more engine work




                                               29
Using Floodgate in a game

•Identify stream processing opportunities
 – Places where lots of data is processed with local
   access patterns
 – Places where work can be prepared early but
   results are not needed until later
•Re-factor to use Floodgate
 – Depending on task, could be as little as a few
   hours.
 – Hard part is enforcing locality




                                                    30
Future proofed?

•Both CPUs and GPUs can function as stream
 processors
•Easily extends to more processing units
•Potential snags are in application changes




                                         31
Questions?

•Ask Stephen!
•Visit Emergent's booth at the show.
 – Booth 5716, North Hall, opposite Intel on the
   central aisle




                                                   32

						
Related docs
Other docs by pX7oV0
BUM listing website 02 14 12
Views: 0  |  Downloads: 0
Slide 1
Views: 0  |  Downloads: 0
The Third Grade Gazzette
Views: 0  |  Downloads: 0
2012 03 03 Settling in review
Views: 0  |  Downloads: 0
Administration
Views: 3  |  Downloads: 0
lunchandlearn crosswell
Views: 0  |  Downloads: 0
Chapter 05 Introduction Roadway Designer
Views: 6  |  Downloads: 0
Florida Health Sciences Library Association
Views: 0  |  Downloads: 0