School of Computing and Creative Technologies
M.Sc. Computer Games Technology
Artificial Neural Network Implementation
Using
Genetic Algorithms on GPU using CUDA
Emre Caglar
0804329
University of Abertay Dundee
December 2009
1. Introduction
1.1 Artificial Neural Networks and Genetic Algorithms on GPU
It is one of the well known facts that, in recent years, artificial intelligence for games has
been more and more important. Considering games that have been produced in recent
years, it is clear that those games look incredible in terms of reality. However, not only is
realistic graphics make the game perfect, but also game developers have to support the
games with so-called intelligence. So we can call a game “so realistic”, if we can succeed to
combine both realistic graphics and artificial intelligence at the same time. The racing game
Colin McRae's Rally 2 is one of the sample that ANN is used to controlled opponent cars in
game.`
Artificial neural networks are one of the most effective AI techniques which are capable of
classification, approximation, learning, decision making and controlling. As a controller and
decision maker, artificial neural networks are widely used in games and its success has been
proved. The biggest constraint with neural networks is the essentially of training process.
Training data is one of the most important factors in terms of weight optimisation process of
neural network. It is relatively expensive operation especially if the training data is large. The
most common technique used to train ANN is the “Back-propagation” algorithm. Basically, it
feeds the neural network with training data and calculates the output. Then, it compares the
output and desired output and calculates the error rate. Finally, it back propagates the error
rate starting from the output layer to input layer and adjust the weights according to error
rate. Despite the fact that Back-Propagation algorithm is the most widely used training
algorithm, it has several disadvantages such that, it can stuck in local minima points, the
transfer function is supposed to be differentiable and it is difficult to parallelize. Moreover, it
is also so time consuming to back propagate error rate especially if the network has lots of
hidden layers and nodes.
Alike artificial neural networks, genetic algorithms are also based on biologic structure of
human and genetic algorithms are also so powerful to solve complex problems in a short and
efficient way. Genetic algorithms have been used to optimise solution space. Doing this, it
uses same principles with biologic genetic process. Firstly, some possible solutions related to
problem are coded in Chromosomes. The individuals carrying these chromosomes compose
the population. Population size and chromosome lengths are predefined values specific to
problem. After initialising initial population, a couple of genetic operators are applied to
chromosomes such as mutation and crossover. The aim is to provide divergency for solution
space. Mutation is made simple changing some bits or values of the chromosomes randomly.
As a result of this process, a new individual is generated. Another operator in terms of
genetic programming is crossover. The most common technique for crossover is to select
two parents and by changing some parts of genes and creating a new individual as a result.
Although crossover and mutation operators are the most common operators used, there is
lots of different implementation techniques for those operators. After mutation and
crossover, we calculate the fitness values of the individuals. The aim is to select individuals
which adapt the environment best. The fitness function is so important for evolution process
since it decided which individuals (solutions) are the best to solve the problem in question.
There are several selection techniques as well. One of them is selection %50 of whole
population whose fitness values are closer to 1 if we consider 1 is the perfect solution.
Another selection technique is that, we choose two random individuals and we select the
one whose fitness value is bigger than other and we eliminate the weak one. This solution is
also provide a opportunity that some weak individuals can also transfer their genes to next
population. It provides us better divergence. However, the method used in this project is
none of them :)
As i mentioned earlier, genetic algorithms are perfect for optimising. So why not to use GAs
to optimise weight values in a neural network. Although back-propagation is the most used
techniques to train neural networks, genetic algorithms became more and more popular to
optimise weights in recent years and some successful results obtained. It has some
advantages as well. Firstly, we do not have to use a differentiable transfer function in ANN
(with back-propagation, while updating weights, we are using the differential of transfer
function).Moreover, GA base solution does not stuck in local minima since it tries to find the
optimum solution in a large solution space using hundreds of chromosome. Lastly, easy to
implement as parallel which we are interested in this project.
Although, even they produce efficient solutions when ANN and GAs are used together, it is
computationally expensive process to train network. Especially for large training data sets
and large populations, it is extremely time consuming operation to train network since each
individuals test the each training values in data set file and produce a error value for each of
training set and then average the error value. Suppose we have 10.000 training data and for
each of them, training process will load the inputs to NN, will get the output value of NN and
will compare it with desired output. Then a fitness value will be calculated depending on
error value. If we have a population with 1000 individual, you can estimate how expensive
the training process is. That is why we have to parallelize it to get efficient results in an
acceptable time interval.
GPUs are very powerful parallel processors nowadays and they offer a huge computation
power and bandwidth. Their architecture is quite different compared to CPU as CPUs are
specialised to execute code in a serial manner, so the architecture is appropriate data
caching and flow control. For that reason, they have just a couple processors unlike GPU.
Unlike CPUs, GPUs are designed as SIMD processors in order to process a huge amount of
data simultaneously. GPU units have started to be used for general purpose computing since
a couple of years. However, programmers had lots of difficulties while implementing code
for GPU as they had to know very strong computer graphics background besides some low
level shader languages such as HLSL or Cg. Moreover, GPUs are not suitable to solve every
kind of problem so; programmer had to port their problem to GPU considering its
architecture and constraints. Beside all these problems, the biggest problem of GPU
programming was the lack of accessing the memory (read/write) randomly. It was possible
to read memory locations; however, it was not possible to write data to memory locations
belongs to graphic cards. So, programmer needed to make some trick in order to save their
data like using textures as memory locations for their data.
All these were problem until the time NVIDIA announced CUDA technology. CUDA abstracts
all the underlying problems defined previously, and offers a general computing framework
for GPUs. Thanks to CUDA, programmers do not have to know shader languages and strong
computer graphics background. Besides, programmers don’t need to know 3D graphics APIs
as well. Moreover, programmers have access to memory for both reading and writing. But
the best thing about CUDA is that programmer can write programs that will run on GPU with
a language look alike C. With CUDA, thousands of threads can run simultaneously. Threads
are grouped in blocks and blocks compose the grid. In order to make mapping algorithms to
GPU easier, threads in blocks can be organised as 1D, 2D or 3D. Like blocks, grid can be
organised 1D or 2D for easy access. These are all specific parameters for specific problems.
Programmer should choose the most efficient parameters for the problem.
1.1 Dissertation Structure
xxxxxxxxxxxxxxxx
2. Literature Review
2.1 Artificial Neural Networks
Artificial Neural Networks (ANN) have their origin in the attempt to simulate by
mathematical means an idealised form of the elementary processing units in the brain and of
their interconnections, signal processing, and self-organisation capabilities. (Soft Computing,
A Tettamanzi, M. Tomassini, p:51)
Artificial Neural Network (ANN) is simple a pattern recognition approach. It is a kind of
model of biological neural network in our brain and tries to simulate processing and learning
approach of brain. So basically, an ANN consists of neurons and each neuron connected
other neuron groups. The connection between those neurons is extremely important
because the stronger the connection is, the more active neuron is to learn and process
ability. A simple brain neuron structure is shown in Figure 1 below.
So how can we model a biological neuron in order to create a artificial network on
computer? We can represent a neural network as groups of artificial neurons grouped in
layers. First layer of ANN represent the input layer and last layer represent the output layer.
Although there are several different kinds of neural network topologies such as recurrent
networks and self-organizing networks, only feed-forward neural networks will be discussed
in this section as it is used in this project.
Between input and output layer, there can be exist several number of hidden layers. Each
neuron is connected to all the neurons in next layer. Beside these, every neuron except input
layer has an additional input called bias. We will focus on the multilayer, feed-forward neural
network for the implementation. This type of network is quite versatile and is capable of
handling a wide variety of problems. So a sample ANN structure is shown below,
Level 1 Level 2 Level 3
B
0
Input 0 0 B
1 B
0
Output 0 1 B
Input 1
2
B
1
Output
1 2 B
Input 2
3 B
Output
2 2
Input 3 B
3
4
B
5
We inject our input to the ANN by using input layer. (Layer 1 in above figure) Choosing
appropriate inputs for ANN is extremely important because a large scale of input data may
cause unwanted results or so narrow input set might not be represent correct data set in
terms of our problem. Input data format is also important. We cannot use “char” type or
“Boolean” type as an input type since ANN can process only real numbers. So somehow we
have to represent our pattern as a number sequence related to our problem. Moreover, it
has lots of advantages if we use normalized input values for ANN like between 0.0-1.0 or 1.0-
100.0.
Another question is the inner structure and how ANN works.
Input 0 bias
W
0
Bias weight
Input 1 W
1 ∑
Activation
δ
Input 2
W
2
Transfer function activation function
Input 3
W
3
Suppose that we have picked a random neuron from the network and there are some
input values from other neurons and weights between the current neuron and other
neurons. In addition to inputs, each neuron has a bias input and associated weight. What
actually bias value does is that it shifts the sum of the weighted input along horizontal x axis.
It provides us a kind of flexibility that we can change the threshold using bias value. Most
probably bias value is chosen 1 or -1. So the total input to a neuron is,
N
Neuron x = ∑ inputi x wi + (bias x bias weight)
i=0
Then we use the output of this formula to calculate the out of the neuron by using a
activation function. An activation function simply gets the net output of the neuron,
operates on it and calculates the output of the neuron. The most commonly used activation
function is sigmoid function or logistic function. We will choose sigmoid function for the rest
of the examples. The formula of the sigmoid function is, (AI for game developers)
F(x) = 1 / (1 + e-x)
As you can see the output of the sigmoid function is always between the interval of 0.0
and 1.0 actually, the results never reach 0 and 1 values. At this point we have to define a
threshold value that decides to activate or deactivate the neuron. Suppose we have chosen
“0.5” as our threshold value. Then when sigmoid function produces a value bigger than 0.5
activates the neuron, otherwise, the neuron is deactivated. Moreover, you can say that if
sigmoid function produces 0.9 we activate the neuron and if it produces 0.1 we deactivate
the neuron. What I mean is that, we can change the active-deactivate behaviour by adjusting
the threshold value.
Actually, in order to get meaningful result from our ANN we have to train it. It can be
said that it is also the most difficult part of implementing a neural network. There are also
different kinds of training algorithms for neural networks. They can be categorized under the
name of supervised and unsupervised training. However, the most commonly used training
algorithm is back-propagation algorithm so I would like to explain the method briefly.
Training process of an ANN actually is an optimization process. To start training process,
we set random weights to neurons. After this initialization, we apply our training input to the
network and calculate the output generated by neuron. The next step is comparing the
generated output and the real output we want to obtain. The difference between those two
values provides us to calculate an error value. The most commonly used error evolution
formula is mean-square error. We can calculate that value with below formula, (AI for game
developers)
ε = ∑ (nc - nd)2 / m
While nc is the calculated value and nd is the desired value and m is the number of output
neurons for each epoch.
According to error value, we adjust the neuron weights to be able to obtain wanted
output. This adjusting process can depend on different techniques or formulas. We will use
the below formula to adjust weights
∆w = ρ δi ni
Ρ is the learning rate, δi is the error value we obtained and ni is the value of neuron being
considered. Then we can change the weight value simply adding this value to the old weight
value.
Although, it is a well know method and widely used algorithm by several applications, Bask-
propagation algorithm has a couple of drawbacks such as local and global minima, learning
topologies and generalization and over fitting. (Soft Computing, Integrating Evolutionary,
Neural, and Fuzzy systems, Chapter:2 p: 63,64,65)
Local & Global Minima
Because of back-propagation is a gradient-descent search algorithm, it finds the closest local
minimum with respect to the starting point of the search. (. (Soft Computing, Integrating
Evolutionary, Neural, and Fuzzy systems, Chapter:2 p:,64) So, considering search space is
multi-dimensional and has lots of local minimum point, most probably back-propagation
algorithm will stuck in a local minimum point as shown below.
Although there are some techniques to come over this problem like “adding momentum”,
(AI for game developers, chapter 14 p: 288) there is not any perfect solution for this problem.
Learning Topologies
Back propagation or similar algorithms adjust weights for fixed feed forward networks.
However, it is clear that the interconnection of the units and their number plays an
important role and lack of knowledge in determining the appropriate topology for a
problem, including the number of layers; the number of neurons per layer often results in
slow learning speed and poor performance. (Soft Computing, Integrating Evolutionary,
Neural, and Fuzzy systems, Chapter:2 p:,64-65) )
Generalization and over fitting
If our network adapts to training data so well, the learning error rate will be too small,
however, the test error rate will be high as it won’t be able to approximate the given
function to produce desired output. In other words, network starts to learn input and
outputs instead of approximate them. (Soft Computing, Integrating Evolutionary, Neural, and
Fuzzy systems, Chapter:2 p:,65-66) )
Due to the problems discussed above, evolutionary algorithms are started to be used in
order to come over these problems and successful results have been obtained. Genetic
algorithms and genetic algorithms based network training & weights optimization will be
discussed in following chapter.
So how good are multi-layered feed-forward networks? In order to answer this question, we
have to figure out to which parameters it depends.
The learning algorithm and the number of iterations
The number of learning samples in order to decide how good the training data represent the
actual function
The number of hidden units
The number of learning sample is one of the important parameter to measure how good the
network is. (Instruction to neural networks, Ben Krose, Patrick van der smagt, November
1996) Krose and Smagt use a sample function to testing the neural network. Y= f(x).They use
two different number of learning sample to compare result. First, they use 4 learning sample
and then they use 20 learning sample. With 4 learning sample, the error Elearning is small,
however Etest is large. Since the learning samples are not enough to approximate the
function to the desired output value. With 20 learning sample, the test error rate decreases
while leaning error increases. The graphics shown below show this two different cases and
how good the neural network approximates the function.
The dashed line gives the desired function, the learning samples are depicted as circles and
the approximation by the network is shown by the drawn line. A) 4 learning samples B) 20
learning samples.
For another test, the same function is used but different hidden unit numbers have been
used. Firstly, network has been tested with 5 hidden units then tested with 20 hidden units.
With 20 hidden units, the effect called overtraining can be seen easily from the results. The
network fits exactly with the learning samples, but because of the large number of hidden
units, the function which is actually represented by the network is far wilder than the
original one.
The dashed line gives the desired function, the circles donate the learning samples and the
drawn line gives the approximation by the function.12 learning sample is used. A) 5 hidden
units B) 20 hidden units
“The example shows that a large number of hidden units leads to small error on the training
set but not necessarily leads to a small error on the test set. Adding hidden units will always
lead to a reduction of the learning error rate. However, adding hidden units will first lead to
a reduction of the test error rate but then lead to an increase of test error rate as shown
below.” (Instruction to neural networks, Ben Krose, Patrick van der smagt, november 1996)
2.2 Genetic Algorithms
Genetic algorithms can be used for any problem where an optimisation process takes place.
Possible solutions are coded as individuals and they evolve over time by adapting themselves
to environment and the ones which do not fit are eliminated.
An evolutionary algorithm maintains a population of candidate solutions for the problem at
hand, and makes it evolve by iteratively applying (usually quite small) set of stochastic
operators, known as mutation, recombination (crossover) and selection. (Soft Computing,
Integrating Evolutionary, Neural, and Fuzzy systems, Chapter:1 p:,2)
A sample evolution process is shown below,
First Generation
Rank Fitness
Selection
Evolution
Taken from AI for game developers
Representation
Since I have intended to use genetic algorithms to optimize just weights in a neural network,
I will not mention about how to encode the ANN architecture and topology into a
chromosome. So, I will discuss about encoding strategies of weights briefly.
For weight based representation, the first basic decision should be made about which
method will be used between bit strings representation and real value representation.
One of the approaches for representing the connection weights is using bit strings. In this
approach each weight value would be represented as a binary string and then decoded into
real values between, say, -10, 10 or -1, 1. This type of representation has some drawbacks.
When the number of connections increase, the length of the binary strings increase to a size
which causes to slower evolution process.
(Evolutionary_algorithms_for_neural_network_design_and_training.pdf, Soft Computing,
Integrating Evolutionary, Neural, and Fuzzy systems, Chapter: 4 section: 4-2 p: 125)
4 ...
I0 5
I1 -2
6 ...
0 0 0 0 0 1 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 1 0 1 1 0 0 0 0 0 1 0
|---------------4--------------|--------------6------------------|-----------------5-----------------|-------------- -2
-------------|
In this sample, each weight value is represented with 8 bit and first bit is used as a sign bit.
So with 7 bit, we can encode the weights between -127, 127. It is suggested in (Combining
Genetic Algorithms and Neural Networks the Encoding Problem) that, one extra bit can be used to
encode if the connection is exists at all or not.
Another approach to encode weights is to use real value representation. Instead of bit
strings, weights are coded directly as real values. Most of the current genetic algorithm
weight optimization methods use real value representation.
(Evolutionary_algorithms_for_neural_network_design_and_training.pdf)
For same network illustrated above, the real value representation is shown below;
4.2 ...
I0 5.4
I1 -2.0
6.1 ...
4.2 6.1 5.4 -2.0
Real values can be implemented as integers or floats and they are usually kept in a certain
interval. When initialising the first population and assigning random values to chromosomes,
usually a pre-defined interval is chosen and numbers are generated randomly in that
interval. Usually, weight values are assigned between the interval of -10, 10 or -20, 20.
(Neural network weight training by mutation, USING A GENETIC ALGORITHM FOR
OPTIMIZATION OF SYNAPTIC WEIGHTS OF NEURAL NETWORKS)
The other important answer that is supposed to be answered is in which order weights are
supposed to code into the chromosome. If you consider crossover operator (which will be
discussed in following sections in detail), it is more likely that the genes far from each other
are disrupted compared to close ones. See the following figure,
Gene1 Gene2 Gene3 Gene4 Gene5 Gene6 Gene7 Gene8 Gene9 Gene10
Cut point
As you can see it is more likely that gene2 and gene9 will be disrupted compared to gene2
and gene3. Because of this feature, it is recommended that to organize genes tightly which
are functionally close. In other words, it is recommended to organize all the incoming and
outgoing weights of a hidden node next to each other.
(Evolutionary_algorithms_for_neural_network_design_and_training.pdf)
Selection of the encoding model strongly depends on the problem we are trying to optimize.
However, we should keep in mind that both binary strings and real value encoding have
some drawbacks. One of the drawback in terms of bit strings is that if the network structure
gets more complex, then chromosome lengths will be too much which result the slow
evolution process as mentioned above. However, the drawbacks of real value code are
utmost important.
One of the problems with real value implementation is that mutation is the only way to
change weight values. What I mean is that, changing the values between two chromosomes
only results another weight combination of parent chromosomes. However, considering bit
string representation, changing bits between chromosomes results a different weights
combination. So, this is one of the weaknesses of real time implementation. (Training of
Neural Networks by means of Genetic Algorithms Working on very long Chromosomes)
//CAN BE EXPRESSED THE REDUNCANSIES in this section as well
Mutation
Mutation operator simply changes the value of some genes in a chromosome randomly. For
bit string encoded chromosomes, mutation accurse by flipping some random bits. Mutation
rate in evolution process decides the probability of mutation in a population. Suppose if the
mutation rate is 2%, it means two individuals in a population with size 100 will be mutated.
See the diagram below to see the effect of mutation for a bit string represented
chromosome.
0 0 1 0 1
0 1 1 0 0
Chromosome before mutation: 00101
Chromosome after mutation: 01100
It is bit different when real value representation is used. A mutation operator adds or
subtracts an integer and shifts the weight value in a small amount. The value that will be
added or subtracted can be chosen in an interval (1...α) where α is the amount of mutation.
Crossover
Crossover operator creates a new individual by selecting genes from parents randomly.
There are different kind of cross over implementation but all of them rely same principles.
First check how crossover operator process on bit string based chromosomes.
Parent 1 Parent2
0 0 1 1 0 1 1 0 0 0
0 0 1 0 0
Above figure shows one-point cut crossover. There exists two or three points cross over
operators as well. By cross over operator, parents change their genetic materials and they
create a new individual as a result.
Fitness Function
Selecting fitness function is one of the most important decision on Genetic Algorithm since it
decide how well the solution for the specified problem. Besides, it id decided whether a
chromosome will transfer its genetic material to next population or not depending on the
chromosome`s fitness value.
Selection
Briefly, selection means the fittest individual survives. Selection is made based on fitness
values of the chromosomes. The chromosome that has the bigger fitness value compared to
others is more likely that it will survive in next generation as well. A couple of different
selection methods exist such as tournament-based selection or roulette-wheel selection.
Tournament based selection simply selects two random individuals from population and it
saves the one with bigger fitness value and discards other one.
Roulette-wheel selection xxx
If we sum up all the steps, we can summarise the process with fallowing flowchart
(http://files.myopera.com/blu3c4t/blog/a_basic_genetic_algorithm.jpg)
So, lastly, what are the termination criteria while evolving the population? Both best fitness
value depending on the error rate and generations number is important when termination
criteria is concerned. Sometime reaching a certain error value cab terminates the progress or
reaching the maximum generation number as well. Moreover, sometimes both of them can
be used. It is totally up to problem domain.
Evolutionary Algorithms on ANN Training
Evolutionary algorithms have being used for years and successful results have being
obtained. (xxx)
The first step is to create initial population. Each chromosome composes of weight as much
as connection number in ANN. Each chromosome represents a possible solution set of
weights for ANN.
When training the neural network with the help of genetic algorithms, fitness values are
calculated according to error rate of ANN for a specific chromosome. The pseudo code has
shown below;
for each chromosome in Population
do
Load weights to ANN
For each data set in training samples
do
Load inputs to ANN
Feed forward the input
Calculate output
Compare actual and desired output
Cumulate the error value
end
end
After processing all the learning data samples, an error value is obtained for the
chromosome. We can use this error value as a fitness value of the chromosome. So, the
smaller fitness value is, the more successful the chromosome is to solve that problem
2.3 Differential Evolution Method
Differential evolution is another kind of evolutionary programming method with some
differences in mutation and crossover operator.
In differential evolution method, new individuals are generated by the combination of other
chromosomes from the population. As mentioned in previous sections, mutation operator is
applied to chromosomes simply changing some random bits or adding/subtracting some
values. However, in differential evolution, two chromosome vectors are selected randomly;
their difference is calculated, after multiplying the difference with a mutation constant, we
add it to another randomly selected chromosome.
Let`s say w represent the chromosomes, g for generation and k for indexing the individual in
population. So, let`s say mutant vector can be generated according to following rules;
= +µ( – )
= +µ( – )
= +µ( – )+µ( – )
Where is the best individual of the gth generation and r1, r2 and r3 is mutually
different and different from the running index i. As you can see, differential evolution
method is in a form of combining other individuals and generating a trial chromosome as a
result. µ > 0 is a real parameter called mutation constant, which controls the amplification of
the difference between two individuals.
After generating mutated individual, crossover (or recombination) operator is applied to the
mutated individual. To decide which gene will be transferred to new individual, a random
number is generated between [0, 1]. If random number is less than a crossover rate, the
gene transferred from trial mutant vector, otherwise from target vector. This processes is
illustrated below,
Population size = 10
Suppose for chromosome 2,
Ch1 Ch2 Ch3 Ch4 Ch5 Ch6 Ch7 Ch8 Ch9 Ch10
+µ( - )
Trial vector
Random
[0, 1]
Result vector
Figure DE evolution
After generating result vector, select it if its fitness value is less then original one. Otherwise,
keep the original one.
2.4 CUDA
INTRODUCTION
CUDA (Compute Unified Device Architecture) is a both hardware and software model which
allows us to run parallel implemented code on GPU without the need of mapping to any
graphics API. CUDA scales the programs to 100`s of cores and 1000`s of threads running
simultaneously. CUDA allows programmer to focus on parallel algorithm instead of the
mechanism of a parallel programming language or graphics APIs.
The program part which runs on GPU simultaneously is called “Kernel” in CUDA literature.
One kernel is executed on GPU at a time and many threads (1000s) execute each kernel.
There are important differences between CPU threads and GPU threads.
CUDA threads are extremely lightweight which means very little creation overhead
and instant switching.
CUDA uses 1000s of threads to run efficiently. However, CPU can use only a few
In CUDA literature, CPU is called as “Host” and GPU is called as “Device”.
THREAD HIERARCHY
CUDA also provides a thread hierarchy in order to make kernel mapping to GPU easier.
Thread hierarchy is also defines which thread can reach which memory locations and also it
is used for thread synchronization.
Each thread runs the same kernel issued by host and has its own local memory and registers.
Local Memory
Registers
Threads are batched in thread blocks. Each block can consist of up to 512 threads and
threads in the block can be organized as 1D, 2D or 3D. Each thread inside the same thread
block has access to shared memory and can cooperate via shared memory.
Shared Memory
Thread Block
Threads in each block can be synchronized however; threads in different blocks can be
synchronized. Moreover, threads in same block can cooperate via shared memory.
Each thread has given a unique thread id when the kernel is lunched. This thread id variable
is a built-in variable and can be assessed by x, y and z components as threads can be
organized as 1D, 2D or 3D.
Thread Block
(0, 0) (1, 0) (2, 0) (3, 0)
(0, 1) (1, 1) (2, 1) (3, 1)
(0, 2) (1, 2) (2, 2) (3, 2)
(0, 3) (1, 3) (2, 3) (3, 3)
As you can see, threads in a 2D block is identified with x and y components and can be
accessed via built-in variable threadIdx
Thread blocks are required to execute independently since they can be executed in any
order, in parallel or in serial. This independence requirement allows thread blocks to be
scheduled in any order across any number of cores.
Thread blocks compose “grid” and grids can be composed of 1D or 2D thread blocks as
shown below;
Shared Memory Shared Memory Shared Memory Shared Memory Shared Memory
Each block is given a unique block id and can be accessed via blockIdx built-in variable. A grid
can consist of 232 blocks resulting in a total of 241 threads.
Block size also can be reached via blockDim variable. It is a 3D variable and to get the size of
each dimension, we can use x, y or z component of the variable.
There is no way to synchronise thread blocks. Only threads in same block can be synchronized. And
different blocks cannot use each other`s shared memory. Each thread has an access to global memory
at grid level.
Global Memory
Kernel1 >
Kernel 2>
Software model of CUDA highly coupled with hardware model. So, in order to understand
software model like thread hierarchy or memory hierarchy, it is better to have a look to
hardware model of CUDA.
ARTHITECTURE AND MEMORY HIERARCHY
Let us see the hardware architecture of NVIDIA G80 series GPU which is capable of running
CUDA as well.
Multi
Processor
TP TP
TP TP
TP TP
TP TP
Shared
Memory
TP = Thread Processor
As you can see G80 series GPUs has 16 Multi-processors which means 16x8 = 128 thread
processor. Each multiprocessor has 8 thread processor and each thread processor in same
multi=processor can access share memory of the multi-processor.
A grid is executed on the device by scheduling thread blocks onto the multiprocessors.
Thereby, each block is mapped to one multiprocessor. Multiple thread blocks can be mapped
onto the same multiprocessor and are executed concurrently. If multiple blocks are mapped
onto a multiprocessor, its recourses, such as register and shared memory, are split among
the mapped thread blocks. This limits the amount of thread blocks that can be mapped onto
the same multiprocessor. It is called occupancy the ratio of active threads on a
multiprocessor to the maximum number of threads supported by a multiprocessor.
(Cuda_data_structure_nice.pdf20/78)
CUDA devices use several memory spaces which has its own characteristics that reflect their
usage in CUDA applications. These memory spaces include global, constant, shared, texture,
local and register as shown below,
HOST DEVICE
DRAM GPU
CPU Multi Processor
Local
Memory
Registers
Shared Memory
Global
Memory
DRAM Chipset Texture Multi Processor
Multi Processor
Constant
As you can see, host can read and write to Global memory, texture memory and constant
memory, however, constant memory and texture memory are read only memory spaces for
GPU. In the scope of this project, only global memory is used to transfer data to GPU.
As you can see, global, texture and local memory spaces reside in device e DRAM location,
they have the greatest access latency (although texture is cached), followed by constant
memory, registers and shared memory.
The various principles traits of the memory types are shown below,
Memory Location Cached Access Scope Lifetime
on/off chip
Register On n/a R/W 1 thread Thread
Local Off No R/W 1 thread Thread
Shared On n/a R/W All threads in block Block
Global Off No R/W All threads + host Host
allocation
Constant Off Yes R All threads + host Host
allocation
Texture Off Yes R All threads + host Host
allocation
SOFTWARE MODEL
CUDA programs can be written either using C for CUDA or CUDA Driver API. C for CUDA is
selected in this project since it has C style syntax and easier to learn.
C for CUDA extends C and allows programmer to define functions, called “kernels” , and
executed N times in parallel by N different CUDA threads unlike traditional C functions.
A kernel is defined with a __global__ declaration specifier which means it will be called by
host. Kernel calls from host code is done by kernel name and parameters followed by >>
syntax to provide some attributes such as block size, grid size, shared memory size per block
etc... Beside kernel functions have to be “void” return type. A sample declaration and host
code showed below;
void __global__ vectorAdd ( float* vec1, float* vec2, float* result
){
//add operation
}
int main (){
//
vectorAdd>>(A, B, C);
}
It invokes a kernel call with just one block in grid and N thread in block.
As I mentioned in previous sections, each thread is given a unique thread id which can be
accessed via built-in threadIdx variable in device code. threadIdx is a dim3 variable type
which is a built-in type in CUDA and has x, y and z components. Just for illustration, we can
improve the code show above with threadIdx variable,
void __global__ vectorAdd ( float* vec1, float* vec2, float* result
){
//add operation
int thread_id = threadIdx.x;
result[thread_id] = vec1[threadId] + vec2[threadId];
}
int main (){
//
vectorAdd>>(A, B, C);
}
This kernel will be executed by several threads with different thread Ids. So that, each thread
will use thread id as an index and will access the variable using that index in arrays and will
perform the adding operation. See the diagram below in order to see how this adding
operation is made in parallel.
0 1 2 3 0 1 2 3 0 1 2 3
Thread id = 0 write
[Add]
Most of the kernel codes use the unique thread Ids in order to access data in parallel. So it is
important to identify each thread correctly. To do this, we use built-in variables like
blockDim, blockIdx and threadIdx. Device memory has a serial and one dimensional memory
layout physically like other memory spaces like main memory in host. So that, 2D and 3D
thread ids are all virtually and just make the mapping easier for programmer. That is why,
thread ids are supposed to converted to unique thread ids in grid. The most common pattern
to find the unique thread id is blockIdx.x * blockDim.x + threadIdx.x
Grid
0 1 2
0 1 2 3 4 0 1 2 3 4 0 1 2 3 4
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
blockIdx.x * blockDim.x + threadIdx.x blockDim.x = 5
Threads within the same block can cooperate among themselves by sharing data through
shared memory and synchronizing their execution to coordinate memory access.
Synchronizing among threads in same block is done via __syncthreads () function provided
by CUDA. This functions acts as a barrier at which all threads in the block must wait before
any is allowed to proceed.
Previously mentioned, host and device have separate memory spaces and data is transferred
from host to device via PCI bus. Host manages device memory, so CUDA provides some c like
functions to allocate, free and copy memory locations. The important point is that
programmer should keep in mind that pointers to memory locations are just numbers. That
is why; we cannot tell from the pointer value whether it shows a memory location on CPU or
GPU. So, dereferencing device pointer on host most probably crashes the program and same
for vice versa.
Functions provided by CUDA to manage device memory is easy to use as they look like C
memory operation functions. Mainly, there are three kinds of important memory managing
functions which are frequently used while developing a CUDA application. These are,
cudaMalloc(void** pointer, size_t nBytes);
cudaFree(void* pointer);
cudaMemcpy(void* dest, void* src, size_t nBytes, enum
cudaMemcpyKind direction);
enum cudaMemcpyKind can be any of the followings,
cudaMemcpyHostToDevice
cudaMemcpyDeviceYoHost
cudaMemcpyDeviceToDevice
As you can see, CUDA interface provides memory copy operations between host and device,
device to host and device to device. The point which should be kept in mind when memory
operation functions are being used is that, programmer should be careful about which
pointer is a device pointer and which one is a host pointer. In order to make the
programming easier, there can be used a specific notation when declaring a variable such
that “h_” prefix for host side variables and “d_” prefix for device side variables. Throughout
the thesis, this notation will be used to manage pointer easier and prevent the memory
crashes.
// Device code
__global__ void VecAdd(float* A, float* B, float* C)
{
int i = blockDim.x * blockIdx.x + threadIdx.x;
if (i >>(d_A, d_B, d_C);
// Copy result from device memory to host memory
// h_C contains the result in host memory
cudaMemcpy(h_C, d_C, size, cudaMemcpyDeviceToHost);
// Free device memory
cudaFree(d_A);
cudaFree(d_B);
cudaFree(d_C);
}
Above code demonstrates the basics of memory operations such as allocate, copy and free.
First of all, memory is allocated on host side by “malloc” functions then device memory is
allocated by “cudaMalloc” CUDA functions and after, vectors are copied form host to device.
Finally kernel is lunched and pointers are freed. These steps are the very basic and common
patterns for memory operations while writing a CUDA application.
Until now, we have talked about allocating variables and one dimensional array. What
happens if we need 2D or 3D arrays on device? In this case, CUDA recommends to the
developers to use cudaMallocPitch () and cudaMalloc3D (). These functions are
recommended for allocation of 2D or 3D arrays as it makes sure that the allocation is
appropriately padded to meet the alignment requirements. The returned pitch must be used
to access array elements. The following code sample allocates a width x height 2D array of
floating-point values and shows how to loop over the array elements in device code:
// Host code
float* devPtr;
int pitch;
cudaMallocPitch((void**)&devPtr, &pitch, width * sizeof(float),
height);
myKernel>>(devPtr, pitch);
// Device code
__global__ void myKernel(float* devPtr, int pitch)
{
for (int r = 0; r >>
grid : grid dimension up to 2D (dim3 type)
block: block dimension up to 3D (dim3 type)
stream: stream ID (optional)
shared_memory: number of bytes per block (optional)
Until now, we have used some defined built-in variables in CUDA. These built-in variables can
be accesses in __global__ and __device__ functions. These variables are listed below,
dim3 gridDim;
o Dimensions of the grid in blocks (at most 2D)
dim3 blockDim;
o Dimensions of the block in threads
dim3 blockIdx;
o Block index within the grid
dim3 threadIdx;
o Thread index within the block
Dim3 is a vector type extending int and can be accessed using x,y and z components.
There are also a couple of built-on variable qualifiers defined by CUDA framework which are;
__device__: If a variable is identified with this qualifier, it means the variable is
stored in global memory which is large and has high latency and no cache. These
kinds of variables are allocated via cudaMalloc described above and accessible by all
threads.
__shared__: These kinds of variables are stored in on-chip shared memory which has
very low latency. They are specified by execution configuration or at compile time
and accessible by all threads in same block.
Unqualified variables like scalars and built-in vector types are stored in registers and if they
do not fit in registers they are spills to local memory.
xxx
3. Methodology
Mainly, I have implemented the project in two parts. One of them is the serial
implementation of neural network trained by genetic algorithms on CPU and the other one is
on GPU using CUDA.
IMPLEMENTING NEURAL NETWORK ON CPU
The first design decision while implementing CPU code was to implement it as low level as I
can do. Because, there is no need to implement hundreds of classes and to encapsulate
everything if it is possible to solve the problem easier and with less coding. Another thing is
that, it is easier to port low level code to GPU then object oriented design and coding.
Another reason for doing both serial and parallel implementation is that, before
implementing parallel, to understand the basics of neural networks and genetic algorithms.
Another thing is that, in order to measure performance between parallel and serial
implementation and to show how fast the parallel implementation is. As I propose that,
parallel implementation on CUDA is much faster than serial implementation which means,
most of the AI workload on CPU can be shifted to GPU and AI in games can be run faster on
GPU allowing CPU to run other stuff.
First of all, ANN composes of nodes and its connections. It is clear that, nodes can be
represented as a 2D array and can be identified by its layer number and node number in that
layer. Similarly, connections can be represented as a 3D array and a connection can be
identified as the connection between the given neuron in given layer and the given neuron in
previous layer.
These arrays could be defined static arrays with a size. However, it would waste memory
spaces as all the layers have not the same number of neuron and moreover, layer numbers
and neuron numbers should be able to given from an external source so that those numbers
are not static. Considering all of these, the best solution is to implement nodes and
connections as pointers. That is why “**” (pointer-to-pointer) and “***” pointer-to-pointer-
pointer are used to define neural network. You can see the diagrams defined below to see
how neural network is represented in memory using pointers.
This diagram shows how a 3 layer (3-4-2) neural network can be implemented by pointers.
Neurons are simple 2D dynamic allocated arrays in memory as we can identify each neuron
using two indices. Layer number and neuron number in that layer.
*
**
X[0][0] = neuron a
x X[1][1] = neuron b
Pointer-to-pointer
Pointer array
(Indicates layers)
Double arrays
a
(Indicates neurons)
b
Weights are represented as 3D arrays as we can identify weights using three parameters.
Layer number, neuron in that layer and neuron in previous layer.
*** Pointer-to-pointer array ** (per layer)
NULL
Pointer array * per neuron
(for each pointer, allocate
space as much as the neuron
numbers in previous layer)
We can index a as
a [2][0][2] means
weights between
the first neuron in
layer 3(output layer)
and neuron 3 in
previous layer (layer
2)
So why needed lots of classes like Neuron class, layer class, neural network class? Neurons
are just 2D arrays and weights are just 3D arrays.
Dynamic memory allocations were done according the fallowing code snap.
//allocate memory for weights
double *** m_neuronWeights = new double **[m_numberOfLayers];
for(int i = 1 ; i > =
DatasetReader::getInstance()->ParseFile(“xx.dat”,
INPUT_NUMBER,
OUTPUT_NUMBER,
TYPE::DOUBLE);
With get instance, we obtain a data set reader reference, and then we parse the file. Parse
file function is a generic function and can be used with different types. Currently, double and
int data types are supported as float can be handled as double. So, from function call, we can
understand that we will read a file containing double values. First attribute of the function
shows which file we will read, second number is how many of them input values, third one
shows how many of them output values, and finally we use an enum (TYPE::DOUBLE or
TYPE::INTEGER) value to indicate to which type we will cast the read string from file as we
can understand the type from template parameter. That is why we have to indicate what
type we are reading second time.
Return type of the function is vector> DataPair is a structure
defined in DatasetReader header file which is composed of one input vector and one output
vector.
GENETIC TRAINING BASED ON DIFFERENTIAL EVOLUTION
The most important part of ANN implementation is the training process. For training, back-
propagation algorithm is used widely, however, due to the drawbacks that I have mentioned
in Literature review section; I have decided to use genetic algorithms to optimize ANN
weights. GAs is also more appropriate to implement in parallel fashion.
Firstly, I have created the structure of chromosomes. A chromosome is simply a double array
which is composed of neural network weights. For the representation of chromosome
composed of weights, please see the figure below,
4.2 2.3
I0 5.4
I1 -2.0
6.1 1.1
double* 4.2 6.1 5.4 -2.0 2.3 1.1
The first step is creating an initial population with random values. Maximum population
number that will be created defined in Constants header file mentioned previously. It is
suggested that, the population size should be 2X getMinFitness () > DESIRED_ERROR_RATE && g-
>mGenerationNumber cycle ();
}
As you can see desired error rate and max generation number values are constants and they
are defined in constants header file. Default values are 0.005 and 2000 for error rate and
max population number respectively. You can change the parameters in order to get
different results.
Cycle method is the main function where the evolution process is takes place. This methods
executes mutation, cross over and selection operators for each individuals. If new trial vector
decreases the error rate, it is accepted new individual, otherwise it is discarded. Please refer
to Figure DE Evolution.
All the functionality related to genetic training is encapsulated within “GeneticTraining”
class. This class also includes an ANN instance for training purpose. So all the parameters
related to genetic process and ANN should be passed to constructor. The constructor of
GeneticTraining class is shown below,
GeneticTraining( int numOfLayers ,
int* neuronsNum ,
int MaxPopulation,
float mutRate ,
float crossRate );
numOfLayers : Total number of layers of ANN
neuronsNum: Indicates the neuron numbers in each layer
MaxPopulation: Indicates max number of population
mutRate:mutation rate that is used during mutation process
crossRate:crossover rate that is used during crossover process
For all these parameters, their default values are defined in constants header file. If any of
them is not defined in constructor, it means their default values will be used.
Now we can describe the steps used during genetic algorithms based training in detail. First
step is initializing the population and calculating the fitness values of the initial population.
Calculation fitness function is one of the most expensive operations during training process
as it calculates every individual`s fitness functions and for each of them it process ANN for
each entry in training dataset file. So, if we have 1000 individual and our training data has
5000 training samples, it means for every fitness calculation cycle, 5000 x 1000 = 5.000.000
calculations will be done which is so expensive.
To store fitness values, a double array with the same size of population is used, so, if our
fitness array is;
Double* fitness_array = new double[MAX_POPULATION_NUMBER];
Then fitness_array[25] stores the 25.th individual`s fitness value. So, after initialising the first
population with random weights, firstly their fitness values are stored and found the best
fittest one.
void GeneticTraining::cycle (){
Chromosome m;
Chromosome c;
Chromosome s;
for(int i = 0 ; i =0 ; k--){
sum += neurons[j - index - (k + 1)] * chr [chromosomeIndex++];
}
//add bias
sum += chr[chromosomeIndex++];
//sigmoid
float calc = expf(-1 * sum);
calc += 1.0f;
calc = 1.0f / calc;
//store value
neurons[j] = calc;
}
}
//load output to output neurons
This is the all code which makes feed-forwarding on a ANN. As you can see, there are no
complex ANN data structures. We define just a neurons[NEURON_NUM] array to store
neuron values. For each layer, weight and neuron value multiplication is made for each
neuron and value is stored. Before progressing on layer, the number of neurons in previous
layers is calculated in order to index neuron value array. The conversion between 1D neuron
value array and real ANN topology is shown below,
Layer 0 Layer 1 Layer 2
Index = 0
Index = 0
Index = 0
N 2, 0
Index = 1
N1, 1
Index = 2
For N 1, 1 for N 2, 0
Previous neurons number + index previous neurons number + index
1+1=2 4+0
0 1 2 3 4
Neurons values array
So that we can represent all neuron values in a 1D array and we can map topology and array
to each other. For instance, for output neuron, loop executes three times as there are three
neurons in previous level. So their values are fetched from neurons array and multiplied by
the weights which are fetched from chromosome. After that, bias value is added (bias weight
is also stored in chromosome) and finally sigmoid function calculates the final output and
store it to neuron. We have to store neuron values somehow as all the neurons except input
layer need the values of previous neurons.
In this method, ANN is represented as just values of products. It allows as running separate
ANN instance for each thread.
Actually, we can improve this method by storing the neuron values for just previous layer
since if we are calculating 3. Hidden layer neuron values, we don’t need to know the neuron
values in first layer as we have already used those values. So we can just use a array with size
of the biggest layer and for each layer calculation, we could update that array instead of
storing all the neurons in ANN.
Another data structure is fitness array. It is simply 1D array with the size of population and
stores the fitness value of the chromosomes by one-to-one basis. It means, if we want to
learn the fitness value of 5th chromosome, we should fetch the value with index 5 Index in
fitness array.
Besides, same data reader class which is used in serial implementation is also used in CUDA
application.
3-Kernel Design
Two kernels are used within the applications which are to calculate the first fitness values
and to train ANN.
Firstly, let’s see the kernel responsible in calculating first fitness values of chromosomes.
Before that, chromosome array is created and initialized by random numbers between the
values -10, 10. Chromosome array consist of chromosomes which are simply the weight
values for the ANN. This initialization is made at host side and same number generator is
used with serial implementation
After, memory allocation is made at GPU memory for training data, initial chromosome
weights and other necessary parameters. After allocation process, all the values in question
are copied to memory locations which are allocation by “CudaMalloc”.
__global__ void calculateFirstFitness(float *individuals, ANNDATA
*trainData,int trainDatasize,float* fitnessVector)
The first parameter is chromosome array for weight values, train data is the structure for
training data, and fitness vector is the array into which weigh values will be stored. This
kernel is called from the host as;
calculateFirstFitness>>
(d_individuals,d_training_data,trainDataSize,d_fitnessArry);
For each chromosome, a thread is created so that, each thread picks a chromosome up to
calculate its fitness with following code snap.
int blockId = blockIdx.x;
int threadId = threadIdx.x;
int index = blockId * blockDim.x + threadId;
Chrosomome individual = &individuals[CONNECTION_NUM * index];
It iterates over the all training samples and calculates the error rate same as in serial
implementation. To feed forward the input data through ANN and to calculate the error rate,
device functions, which will be described in the next section, are used. After calculating the
error rate for each individual, fitness value (error rate) is written to related fitness array
index. So the aim is to decrease the fitness value (decreasing the error rate).
The most important part of the application is the kernel responsible in training of ANN. It is
the heart of the application where all the training process takes place.
The main steps are almost same with serial implementation but implemented in parallel
manner. At first, let`s see the signature of the kernel and how it is called from host side.
__global__ void evolvePopulation(float* individuals,ANNDATA
*training_data, int trainDataSize, Rand48 *random, float
*fitnessVector, Chrosomome res)
Random is the random number generator used for CUDA and will be described in detail. The
best chromosome is returned in “res” array.
As a design decision, training data is cached in the shared memory for fast access. Limited
shared memory storage is a restriction however; we assume that training data will not bigger
than 16KB. The advantage of Genetic algorithm based training is that, we don’t need large
size training data. So, there is no problem with that assumption to store training data in
shared memory. If we are using genetic algorithms for training purpose, we would better
take advantage of GA by using few training data sample.
extern __shared__ ANNData data[];
//cache trainig data to shared memory
if(threadId == 0){
for(int i = 0 ; i < trainDataSize ; i++)
data[i] = training_data[i];
}
__syncthreads();
Data array was decelerated as __shared__ since it will be stored in shared memory; it is
extern because we don’t know the size at compile time. After this declaration, the 0th thread
in each block is responsible for copying training data from global memory to shared memory.
After copy operation, barrier synchronization is used since we want to be sure training data
is stored in shared memory before starting to training process.
At this point, I want to point out a performance related term called “bank conflict”. Shared
memory in GPU divided into sections called “Bank”. The aim is to access these banks
simultaneously by different threads. However, if more than one thread wants to access same
bank, CUDA serializes and allow them to access that memory location one after another. It
decreases the parallel performance of the program as different threads try to access same
bank resides in shared memory. As a result, they cannot access the bank in parallel so they
access it in serial manner. (NVIDIA cuda programming guide)
If you look at training data fetching code, it can be seen easily that 0th threads of all blocks
try to access same training data element at the same time. This creates a bank conflict
described above. Although there are some methods to solve bank conflict problems, it was
not handled in the scope of this project. Same bank conflict problem resides while
calculating the error rate which will be discussed later.
To store training data in the shared memory provides us a low latency access to data. It is
utmost important as while calculating the fitness value, thread has to pass over all the
training data one by one and all the threads have to do same operation. Assume that our
block size is 256 and we have 128 training samples. It means 256 * 128 = 32768 times thread
access global memory. Let us say each global memory access takes about 600 cycle, 32768 *
600 = 19.660.800 cycle in order to process all training samples for a block. However, in our
approach, 0th thread will access the global memory 128 times which results 128 * 600 =
76800 cycle then for every access to shared memory will take about 3-4 cycle for thread
which means 256 * 3 = 768 cycle in total. So totally, 76800 + 768 = 77568 cycle when we use
shared memory. If we compare these results, approximately, shared memory solution will
run 253X times faster than just global memory solution.
Fitness array is also cached in shared memory for fast access. Instead of all fitness arrays, just
related part of it stored in shared memory as shown below, (let us assume that block size is 5
for easy presentation)
Fitness Array
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 ..
Block 0 Block 1 Block 2
__shared__ float s_fitVec[BLOCK_SIZE];
s_fitVec[threadId] = fitnessVector[index];
__syncthreads();
Each thread in block is responsible in fetching the related fitness value and copying to the
shared memory. As you can see global fitness array is indexed with unique thread id
however, shared fitness vector is indexed with local thread id in block as we keep just the
related part of global fitness array.
After caching kernel related data to shared memory, we can start our evolution process.
(Mutation, crossover and selection)
First of all, three random numbers are generated to select different individuals in population
for mutation operation. These random numbers are used as indices to select individuals.
After selecting individuals, mutation function is called which will be explained in next section.
Mutate operator creates a mutated vector.
The next operation is crossover operation. The way It is done described in serial
implementation as well. However, instead of generating random numbers during crossover
operation, they are generated before and passed to crossover function as a parameter. After
this operation, we have an evolved individual. So, the next step is selection which means
checking whether it reduces fitness value or not. It is done with following code snap;
float result[OUTPUT_NUM];
float err = 0.0f;
for(int i = 0 ; i < trainDataSize ; i++){
feedForward(crossVec,data[i].input,result);
err += getANNerror(result,data[i].output);
}
if(err < s_fitVec[threadId]){
s_fitVec[threadId] = err;
for(int indx = 0 ; indx < CONNECTION_NUM ; indx++){
individuals[index * CONNECTION_NUM + indx] = crossVec[indx];
}
}
If it is decided that new vector reduce the error rate, it is written to fitness vector stored in
shared memory and new individual is replaced with old one. Finally, each thread in block
copies its fitness value from shared memory to global memory. So, one cycle for evolution is
completed.
For CUDA side random number generation, a linear congruential generator has been used.
The source code was gathered from http://www.amolf.nl/~vanmeel/mdgpu/download.html. It is
created on host size, initialized by a seed than passed to kernel as a parameter. In kernel
code, before generating random number, the state is loaded first, then numbers are
generated then the state is stored again. The sample code in terms of code generation is
shown below,
rand48_loadState(rng);
rand48_nextFloat(rng);
rand48_storeState(rng);
4-Device Functions Design
Device functions called from kernel are showed below,
__device__ void feedForward(Chrosomome chr, float* data,
float *result)
Described in previous section
__device__ float getANNerror(float* actual, float
*desired)
Calculates the error rate between actual and desired values and returns the error value.
__device__ void mutate(float *r1, float *r2, float
*r3,float *mutVec)
Creates a mutated vector according to, = +µ( – ) formula and
returns mutated vector in mutVec pointer.
__device__ void crossover(float *original, float *trial,
float *randomNumbers, float *crossed)
Makes crossover between original and trial vectors by using randomNumbers and
returns the result vector in crossed pointer.
__device__ void findMinFitness(float* fitnessVec, float
*result)
Scans the fitness array and finds the minimum fitness value and returns it in result.
4. Results
In order to test our CUDA application, sinus function has been used as benchmark since it is a
common benchmark method for ANNs and easy to generate its training data. (Find sin
benchmark sources)
For testing purpose, 5 different training set have been created between 0 and 90 degree.
These training data have been sampled for each 0.25, 0.5, 1, 1.5 and 2.5 degrees.
To approximate sinus function, 3 layers ANN has been used, 1 input layer, 1 hidden layer and
1 output layer. Besides, hidden layer consists of 6 neurons.
Default mutation rate is 0.2 and default crossover rate is 0.6. Beside, the performance is
measured for different population sizes. While comparing performance between serial and
parallel implementation, same parameters have been used.
5. Conclusion
6. Future Work
References
Bibliography
GA - Genetic Algorithms
DA – Differential Evolution
ANN - Artificial Neural Network
GPU - Graphic Processing Unit
CUDA - Compute Unified Device Architecture
CPU - Central Processing Unit
SIMD - Single instruction Multiple Data
HLSL - High Level Shading Language
Cg - C for Graphics