CUDA LDI Project Report
Efficient collision detection is a fundamental problem in physically-based simulation
and in computer animation. In order the accelerate collision detection for rigid bodies,
many approaches based on pre-computed bounding-volume hierarchies have been
proposed. However, in case of deformable objects these hierarchical data structures
cannot be pre-computed, but have to be updated frequently, and it's computation
prohibitive for dynamic environments. Recently, image-space techniques have been
proposed for collision detection. These approaches commonly process projections of
objects and they do not require any pre-processing and employ graphics hardware.
Therefore, they are especially appropriate for dynamic environments.
Layered Depth Images (LDI), a new image-space technique for collision detection for
arbitrarily shaped, deforming objects, is presented in the recent papers . An LDI is a
view of the scene from a single input camera view, but with multiple pixels along each
line of sight. A layered depth pixel stores a set of depth pixels along one line of sight
sorted in front to back order. The front element in the layered depth pixel samples the
first surface seen along that line of sight, the next pixel in the layered depth pixel
samples the next surface seen along that line of sight, etc.
The LDI approach detects collisions and self-collisions of 3D objects of manifold
geometry. Although the approach is not confined to triangular messes, a watertight
object surface is required in order to perform volumetric collision queries. The
algorithm computes an approximative volumetric representation of the object. This
representation is used for three different collision queries. The algorithm proceeds in
three stages :
2.1.1. Stage 1: Computes the Volume-of-Interest (Vol)
The VoI is an axis-aligned bounding-box (AABB) representing the volume where
collision queries are performed. In the case of self-collision test, the VoI is chosen to be
the AABB of the object. When a collision test between a pair of objects is performed,
the VoI is not empty, the objects and the VoI are further processed in stage 2.
2.1.2. Stage 2: Computes an LDI for objects inside the VoI.
Note, that LDI generation is restricted to the VoI and object primitives outside the VoI
are discarded. The LDI consists of images or layers of depth values representing object
surfaces. The depth values of the LDI can be interpreted as intersections of parallel rays
or 3D scan lines entering or exit the object. Thus, an LDI classifies the VoI into inside
and outside regions with respect to an object. Additional information on face
orientation is stored in the LDI. Thus, depth and front-face / back-face classification is
known for each entry in the LDI data structure. Stage 2 results in an LDI with sorted
depth values and explicitly labeled entry (front-face) and exit (back-face) points within
2.1.3. Stage 3: Performs one of these possible collision queries.
a) Self-collision are detected by analyzing the order of entry (front-face) and exit
(back-face) points within the LDI. If they correctly alternate, there is no self-collision.
If invalid sequences of front-face and back-faces are detected, the operation provides
the explicit representation of the self-intersection volume. b) Collision between pairs of
objects are detected by combining their LDIs using Boolean intersection. If the
intersection of all inside regions is empty, no collision is detected. If it is not, the
operation provides an explicit representation of the intersection volume. c) Individual
vertices are tested against the volume of the object. The vertex is transformed into the
local coordinate system of the LDI. If a transformed vertex intersects with an inside
region, a collision is detected.
Figure 1 The LDI process.(a) AABB computations, (b) Rasterization of each object, (c) Volume
2.2. Depth Peeling
Depth Peeling is the underlying technique that makes order-independent transparency
possible. The standard depth test gives us the nearest fragment at each pixel, but there is
also a fragment that is second nearest, third nearest, and so on. Standard depth testing
gives us the nearest fragment without imposing any ordering restrictions, however, it
does not give us any straightforward way to render the second nearest or nth nearest
Depth peeling solves this problem. The essence of what happens with this technique is
that with n passes over a scene, we can get n layers deeper into the scene. For example,
with 2 passes over the scene, we can extract the nearest and second nearest surfaces in a
scene. We get both the depth and color (RGBA) information for each layer.
As what we concern is the depth value of each pixel, depth peeling algorithm can help
us to resolve the depth value of the fragments on the surfaces of object in a
Figure 2 These images illustrate simple depth peeling. Layered 0 shows the nearest depth,
layer 1 shows the second depths, and so on.
2.3. CUDA based LDI generation
The VoI computation in stage 1 and the collision queries in stage 3 do not significantly
contribute to the computational cost of the algorithm. and the LDI generation in stage 2
is comparatively expensive. CUDA provides more flexible control over the GPU
memory so that we can capture multiple fragments in a single pass. With a CUDA
rasterizer many graphics applications can benefit from the free control of GPU memory,
especially for the multi-fragment effects. F. Liu et al. present two efficient schemes to
capture and sort multiple fragments per pixel in a single geometry pass via the atomic
operations of CUDA without read-modify-write (RMW) hazards . Experimental
results show significant speed up to classical depth peeling, especially for large scenes.
3. Algorithm Overview
The diagram below gives an overview of the algorithm about CUDA LDI generation
Figure 3 Algorithm Overview
Our CUDA rasterizer is designed by packing the triangles into a texture and pass it to a
CUDA kernel, with each thread projecting a single triangle onto the screen and
rasterizing it by the scan-line algorithm. On each pixel location covered by the
projected triangle, a fragment will be generated with interpolated attributes, such as
depth. Then the LDI generation algorithm will be applied on the pixel.
As the stage 2 of LDI generation algorithm is processed in a parallel way with the
help of CUDA framework, the overall performance of the triangle can be expedited
For the LDI generation kernel, the diagram below gives a detail depiction:
Figure 4 Detail Diagram of LDI generation kernel
4. Algorithm Detail
4.1. Triangle Mapping
Because layered depth image is a 2D plane, all the points on the 3D space should be
projected onto the 2D plane. As the surface of objects in 3D space is composed of
thousands of triangles, all the triangles should be projected on the 2D plane. In most
cases, the 2D plane is the screen.
Figure 5 Triangle Mapping
As is shown in the demonstration above, all the vertices on the object is projected onto
the plane. In our cases, orthographic projection is applied.
4.2. Triangle Rasterisation
Because the coordinates of the vertex is continuous float number, while the data unit for
LDI is pixel, so all the triangles projected on the 2D plane should be rasterized into
pixels for further processing.
Figure 6 Triangle Rasterisation
Scan-line algorithm was adopted to implementing the triangle rasterisation. All the
scan-lines are horizon line. For every scan-line with certain y value, we will check if
there is any intersection with the edges of the triangle. If so, linear interpolation was
applied to calculate the x value of left endpoint x1 and right endpoint x2. Then start
from x1, step through x2 by increasing the x value one by one, we will check whether
the pixel is inside the triangle or not.
4.3. Bilinear Interpolation
For LDI processing, we need know the depth, i.e. z value of arbitrary pixel inside the
triangle. Bilinear interpolation was applied to reach this goal.
Figure 7 Bilinear Interpolation
In mathematics, bilinear interpolation is an extension of linear interpolation for
interpolating functions of two variables on a regular grid. The key idea is to perform
linear interpolation in one direction, and then again in the other direction.
Suppose that we want to find the z value of P at the point P=(x, y). It's assumed that we
know the z value at the four points Q11, Q21, Q12 and Q22. We first do linear
interpolation in the x-direction. This yields:
We proceed by interpolating the y-direction.
This gives us the desired estimate of z(x, y).
5. Memory Hierarchy
5.1. Placement of vertices information
For one object, the memory layout for the triangles on the surfaces is are placed in
continuous space, as demonstrated as below:
Triangle 1 Triangle 1 Triangle 1 Triangle 2 Triangle 2 …
Vertex index 1 Vertex 2 Vertex 3 Vertex 1 Vertex 2
Each vertex has three coordinate(x, y, z).
Vertex 1 x coord. Vertex 1 y coord Vertex 1 z coord Vertex 2 x coord …
Each CUDA thread function processes one vertex. Part of the pseudo code snippet is
listed as below:
For each thread:
VertexIndex1 = *(TriangleIndexMat + 1 + threadId * 3);
VertexIndex2 = *(TriangleIndexMat + 1 + threadId * 3 + 1);
VertexIndex3 = *(TriangleIndexMat + 1 + threadId * 3 + 2);
Vertex1x = *(VertexMat + VertexIndex1 * 3);
Vertex1y = *(VertexMat + VertexIndex1 * 3 + 1);
Vertex1z = *(VertexMat + VertexIndex1 * 3 + 2);
The z depth value and direction will be resolved from the LDI kernel function. The
calculated fragment information after the processing of CUDA LDI thread function will
be stored in global memory. It's a large 2D array (idiMat) with the layout. All the
fragments in one pixel will be placed in a one-demission array, with the length of
maximum depth predefined in the 3D space. The first element of the array is the counter
about how many fragments in the pixel, and the following data will be the z depth value
and direction for the fragments. The detail layout is demonstrated as below:
Figure 8 Memory Layout of LDI generated data
6. Result Evaluation
To evaluate the correctness of our algorithm, we will compare the result with the depth
peeling based LDI generation software from NVIDIA .
We will compare the processing time and frame rate with the paper  of F. Liu et al to
evaluate the efficiency of our algorithm.
The result from the paper of F. Liu et al. is listed as below:
Model Dragon Buddha Lucy Statue Bunny
Tri No. 871K 1.0M 2.0M 10.0M 70K
CUDP 1 321 fps 287 fps 153 fps 59 fps 459 fps
CUDP 2 363 fps 309 fps 165 fps 61 fps 573 fps
SRAB 116f/2g 106f/2g 49f/2g 8f/2g 334f/2g
DP 29f/13g 26f/13g 14f/13g 2f/15g 128f/13g
. F. Liu, M. Huang, X. Liu, E, Wu, Single Pass Depth Peeling via CUDA Rasterizer. SIGGRAPH
. Everitt, C. 2001. Interactive order-independent transparency. Tech. rep., NVIDIA Corporation.
. B. Heidelberger, M. Teschner, M. Gross, Detection of Collisions and Self-collisions Using
Image-space Techniques. Journal of WSCG, Vol. 12, No. 1-3, WSGC 2004, Feb 2-6, 2004, Plzen,
. F. Faure, S. Barbier, J. Allard, F. Falipou. Image-based Collision Detection and Response between
Arbitrary Volume Objects. Eurographics / ACM SIGGRAPH Symposium on Computer