Dissertation by linzhengnd


									     School of Computing and Creative Technologies

         M.Sc. Computer Games Technology

Artificial Neural Network Implementation
Genetic Algorithms on GPU using CUDA

                   Emre Caglar

             University of Abertay Dundee

                  December 2009
1. Introduction

  1.1 Artificial Neural Networks and Genetic Algorithms on GPU

   It is one of the well known facts that, in recent years, artificial intelligence for games has
been more and more important. Considering games that have been produced in recent
years, it is clear that those games look incredible in terms of reality. However, not only is
realistic graphics make the game perfect, but also game developers have to support the
games with so-called intelligence. So we can call a game “so realistic”, if we can succeed to
combine both realistic graphics and artificial intelligence at the same time. The racing game
Colin McRae's Rally 2 is one of the sample that ANN is used to controlled opponent cars in

Artificial neural networks are one of the most effective AI techniques which are capable of
classification, approximation, learning, decision making and controlling. As a controller and
decision maker, artificial neural networks are widely used in games and its success has been
proved. The biggest constraint with neural networks is the essentially of training process.
Training data is one of the most important factors in terms of weight optimisation process of
neural network. It is relatively expensive operation especially if the training data is large. The
most common technique used to train ANN is the “Back-propagation” algorithm. Basically, it
feeds the neural network with training data and calculates the output. Then, it compares the
output and desired output and calculates the error rate. Finally, it back propagates the error
rate starting from the output layer to input layer and adjust the weights according to error
rate. Despite the fact that Back-Propagation algorithm is the most widely used training
algorithm, it has several disadvantages such that, it can stuck in local minima points, the
transfer function is supposed to be differentiable and it is difficult to parallelize. Moreover, it
is also so time consuming to back propagate error rate especially if the network has lots of
hidden layers and nodes.

Alike artificial neural networks, genetic algorithms are also based on biologic structure of
human and genetic algorithms are also so powerful to solve complex problems in a short and
efficient way. Genetic algorithms have been used to optimise solution space. Doing this, it
uses same principles with biologic genetic process. Firstly, some possible solutions related to
problem are coded in Chromosomes. The individuals carrying these chromosomes compose
the population. Population size and chromosome lengths are predefined values specific to
problem. After initialising initial population, a couple of genetic operators are applied to
chromosomes such as mutation and crossover. The aim is to provide divergency for solution
space. Mutation is made simple changing some bits or values of the chromosomes randomly.
As a result of this process, a new individual is generated. Another operator in terms of
genetic programming is crossover. The most common technique for crossover is to select
two parents and by changing some parts of genes and creating a new individual as a result.
Although crossover and mutation operators are the most common operators used, there is
lots of different implementation techniques for those operators. After mutation and
crossover, we calculate the fitness values of the individuals. The aim is to select individuals
which adapt the environment best. The fitness function is so important for evolution process
since it decided which individuals (solutions) are the best to solve the problem in question.
There are several selection techniques as well. One of them is selection %50 of whole
population whose fitness values are closer to 1 if we consider 1 is the perfect solution.
Another selection technique is that, we choose two random individuals and we select the
one whose fitness value is bigger than other and we eliminate the weak one. This solution is
also provide a opportunity that some weak individuals can also transfer their genes to next
population. It provides us better divergence. However, the method used in this project is
none of them :)

As i mentioned earlier, genetic algorithms are perfect for optimising. So why not to use GAs
to optimise weight values in a neural network. Although back-propagation is the most used
techniques to train neural networks, genetic algorithms became more and more popular to
optimise weights in recent years and some successful results obtained. It has some
advantages as well. Firstly, we do not have to use a differentiable transfer function in ANN
(with back-propagation, while updating weights, we are using the differential of transfer
function).Moreover, GA base solution does not stuck in local minima since it tries to find the
optimum solution in a large solution space using hundreds of chromosome. Lastly, easy to
implement as parallel which we are interested in this project.

Although, even they produce efficient solutions when ANN and GAs are used together, it is
computationally expensive process to train network. Especially for large training data sets
and large populations, it is extremely time consuming operation to train network since each
individuals test the each training values in data set file and produce a error value for each of
training set and then average the error value. Suppose we have 10.000 training data and for
each of them, training process will load the inputs to NN, will get the output value of NN and
will compare it with desired output. Then a fitness value will be calculated depending on
error value. If we have a population with 1000 individual, you can estimate how expensive
the training process is. That is why we have to parallelize it to get efficient results in an
acceptable time interval.
GPUs are very powerful parallel processors nowadays and they offer a huge computation
power and bandwidth. Their architecture is quite different compared to CPU as CPUs are
specialised to execute code in a serial manner, so the architecture is appropriate data
caching and flow control. For that reason, they have just a couple processors unlike GPU.
Unlike CPUs, GPUs are designed as SIMD processors in order to process a huge amount of
data simultaneously. GPU units have started to be used for general purpose computing since
a couple of years. However, programmers had lots of difficulties while implementing code
for GPU as they had to know very strong computer graphics background besides some low
level shader languages such as HLSL or Cg. Moreover, GPUs are not suitable to solve every
kind of problem so; programmer had to port their problem to GPU considering its
architecture and constraints. Beside all these problems, the biggest problem of GPU
programming was the lack of accessing the memory (read/write) randomly. It was possible
to read memory locations; however, it was not possible to write data to memory locations
belongs to graphic cards. So, programmer needed to make some trick in order to save their
data like using textures as memory locations for their data.

All these were problem until the time NVIDIA announced CUDA technology. CUDA abstracts
all the underlying problems defined previously, and offers a general computing framework
for GPUs. Thanks to CUDA, programmers do not have to know shader languages and strong
computer graphics background. Besides, programmers don’t need to know 3D graphics APIs
as well. Moreover, programmers have access to memory for both reading and writing. But
the best thing about CUDA is that programmer can write programs that will run on GPU with
a language look alike C. With CUDA, thousands of threads can run simultaneously. Threads
are grouped in blocks and blocks compose the grid. In order to make mapping algorithms to
GPU easier, threads in blocks can be organised as 1D, 2D or 3D. Like blocks, grid can be
organised 1D or 2D for easy access. These are all specific parameters for specific problems.
Programmer should choose the most efficient parameters for the problem.

1.1 Dissertation Structure


2. Literature Review

  2.1 Artificial Neural Networks

  Artificial Neural Networks (ANN) have their origin in the attempt to simulate by
mathematical means an idealised form of the elementary processing units in the brain and of
their interconnections, signal processing, and self-organisation capabilities. (Soft Computing,
A Tettamanzi, M. Tomassini, p:51)

Artificial Neural Network (ANN) is simple a pattern recognition approach. It is a kind of
model of biological neural network in our brain and tries to simulate processing and learning
approach of brain. So basically, an ANN consists of neurons and each neuron connected
other neuron groups. The connection between those neurons is extremely important
because the stronger the connection is, the more active neuron is to learn and process
ability. A simple brain neuron structure is shown in Figure 1 below.
So how can we model a biological neuron in order to create a artificial network on
computer? We can represent a neural network as groups of artificial neurons grouped in
layers. First layer of ANN represent the input layer and last layer represent the output layer.
Although there are several different kinds of neural network topologies such as recurrent
networks and self-organizing networks, only feed-forward neural networks will be discussed
in this section as it is used in this project.

Between input and output layer, there can be exist several number of hidden layers. Each
neuron is connected to all the neurons in next layer. Beside these, every neuron except input
layer has an additional input called bias. We will focus on the multilayer, feed-forward neural
network for the implementation. This type of network is quite versatile and is capable of
handling a wide variety of problems. So a sample ANN structure is shown below,
                          Level 1                     Level 2                 Level 3



                Input 0    0                        B

                                                        1                 B

        Output 0           1                        B
               Input 1

1                          2                        B
                Input 2

                                                        3                     B
2                                                                                 2

                Input 3                             B



We inject our input to the ANN by using input layer. (Layer 1 in above figure) Choosing
appropriate inputs for ANN is extremely important because a large scale of input data may
cause unwanted results or so narrow input set might not be represent correct data set in
terms of our problem. Input data format is also important. We cannot use “char” type or
“Boolean” type as an input type since ANN can process only real numbers. So somehow we
have to represent our pattern as a number sequence related to our problem. Moreover, it
has lots of advantages if we use normalized input values for ANN like between 0.0-1.0 or 1.0-

    Another question is the inner structure and how ANN works.
             Input 0                              bias
                                             Bias weight

             Input 1      W
                          1                          ∑

             Input 2
                                          Transfer function          activation function
             Input 3

     Suppose that we have picked a random neuron from the network and there are some
input values from other neurons and weights between the current neuron and other
neurons. In addition to inputs, each neuron has a bias input and associated weight. What
actually bias value does is that it shifts the sum of the weighted input along horizontal x axis.
It provides us a kind of flexibility that we can change the threshold using bias value. Most
probably bias value is chosen 1 or -1. So the total input to a neuron is,

    Neuron x = ∑ inputi x wi + (bias x bias weight)

    Then we use the output of this formula to calculate the out of the neuron by using a
activation function. An activation function simply gets the net output of the neuron,
operates on it and calculates the output of the neuron. The most commonly used activation
function is sigmoid function or logistic function. We will choose sigmoid function for the rest
of the examples. The formula of the sigmoid function is, (AI for game developers)

    F(x) = 1 / (1 + e-x)
    As you can see the output of the sigmoid function is always between the interval of 0.0
and 1.0 actually, the results never reach 0 and 1 values. At this point we have to define a
threshold value that decides to activate or deactivate the neuron. Suppose we have chosen
“0.5” as our threshold value. Then when sigmoid function produces a value bigger than 0.5
activates the neuron, otherwise, the neuron is deactivated. Moreover, you can say that if
sigmoid function produces 0.9 we activate the neuron and if it produces 0.1 we deactivate
the neuron. What I mean is that, we can change the active-deactivate behaviour by adjusting
the threshold value.

     Actually, in order to get meaningful result from our ANN we have to train it. It can be
said that it is also the most difficult part of implementing a neural network. There are also
different kinds of training algorithms for neural networks. They can be categorized under the
name of supervised and unsupervised training. However, the most commonly used training
algorithm is back-propagation algorithm so I would like to explain the method briefly.

    Training process of an ANN actually is an optimization process. To start training process,
we set random weights to neurons. After this initialization, we apply our training input to the
network and calculate the output generated by neuron. The next step is comparing the
generated output and the real output we want to obtain. The difference between those two
values provides us to calculate an error value. The most commonly used error evolution
formula is mean-square error. We can calculate that value with below formula, (AI for game

    ε = ∑ (nc - nd)2 / m

   While nc is the calculated value and nd is the desired value and m is the number of output
neurons for each epoch.
         According to error value, we adjust the neuron weights to be able to obtain wanted
    output. This adjusting process can depend on different techniques or formulas. We will use
    the below formula to adjust weights

        ∆w = ρ δi ni

    Ρ is the learning rate, δi is the error value we obtained and ni is the value of neuron being
    considered. Then we can change the weight value simply adding this value to the old weight

    Although, it is a well know method and widely used algorithm by several applications, Bask-
    propagation algorithm has a couple of drawbacks such as local and global minima, learning
    topologies and generalization and over fitting. (Soft Computing, Integrating Evolutionary,
    Neural, and Fuzzy systems, Chapter:2 p: 63,64,65)

   Local & Global Minima

    Because of back-propagation is a gradient-descent search algorithm, it finds the closest local
    minimum with respect to the starting point of the search. (. (Soft Computing, Integrating
    Evolutionary, Neural, and Fuzzy systems, Chapter:2 p:,64) So, considering search space is
    multi-dimensional and has lots of local minimum point, most probably back-propagation
    algorithm will stuck in a local minimum point as shown below.

    Although there are some techniques to come over this problem like “adding momentum”,
    (AI for game developers, chapter 14 p: 288) there is not any perfect solution for this problem.

   Learning Topologies

    Back propagation or similar algorithms adjust weights for fixed feed forward networks.
    However, it is clear that the interconnection of the units and their number plays an
    important role and lack of knowledge in determining the appropriate topology for a
    problem, including the number of layers; the number of neurons per layer often results in
    slow learning speed and poor performance. (Soft Computing, Integrating Evolutionary,
    Neural, and Fuzzy systems, Chapter:2 p:,64-65) )

   Generalization and over fitting

    If our network adapts to training data so well, the learning error rate will be too small,
    however, the test error rate will be high as it won’t be able to approximate the given
    function to produce desired output. In other words, network starts to learn input and
    outputs instead of approximate them. (Soft Computing, Integrating Evolutionary, Neural, and
    Fuzzy systems, Chapter:2 p:,65-66) )

    Due to the problems discussed above, evolutionary algorithms are started to be used in
    order to come over these problems and successful results have been obtained. Genetic
    algorithms and genetic algorithms based network training & weights optimization will be
    discussed in following chapter.

    So how good are multi-layered feed-forward networks? In order to answer this question, we
    have to figure out to which parameters it depends.

   The learning algorithm and the number of iterations
   The number of learning samples in order to decide how good the training data represent the
    actual function
   The number of hidden units

    The number of learning sample is one of the important parameter to measure how good the
    network is. (Instruction to neural networks, Ben Krose, Patrick van der smagt, November
    1996) Krose and Smagt use a sample function to testing the neural network. Y= f(x).They use
    two different number of learning sample to compare result. First, they use 4 learning sample
    and then they use 20 learning sample. With 4 learning sample, the error Elearning is small,
    however Etest is large. Since the learning samples are not enough to approximate the
    function to the desired output value. With 20 learning sample, the test error rate decreases
while leaning error increases. The graphics shown below show this two different cases and
how good the neural network approximates the function.

The dashed line gives the desired function, the learning samples are depicted as circles and
the approximation by the network is shown by the drawn line. A) 4 learning samples B) 20
learning samples.

For another test, the same function is used but different hidden unit numbers have been
used. Firstly, network has been tested with 5 hidden units then tested with 20 hidden units.

With 20 hidden units, the effect called overtraining can be seen easily from the results. The
network fits exactly with the learning samples, but because of the large number of hidden
units, the function which is actually represented by the network is far wilder than the
original one.

The dashed line gives the desired function, the circles donate the learning samples and the
drawn line gives the approximation by the function.12 learning sample is used. A) 5 hidden
units B) 20 hidden units
“The example shows that a large number of hidden units leads to small error on the training
set but not necessarily leads to a small error on the test set. Adding hidden units will always
lead to a reduction of the learning error rate. However, adding hidden units will first lead to
a reduction of the test error rate but then lead to an increase of test error rate as shown
below.” (Instruction to neural networks, Ben Krose, Patrick van der smagt, november 1996)

  2.2 Genetic Algorithms

 Genetic algorithms can be used for any problem where an optimisation process takes place.
Possible solutions are coded as individuals and they evolve over time by adapting themselves
to environment and the ones which do not fit are eliminated.

An evolutionary algorithm maintains a population of candidate solutions for the problem at
hand, and makes it evolve by iteratively applying (usually quite small) set of stochastic
operators, known as mutation, recombination (crossover) and selection. (Soft Computing,
Integrating Evolutionary, Neural, and Fuzzy systems, Chapter:1 p:,2)

A sample evolution process is shown below,
     First Generation

     Rank Fitness



Taken from AI for game developers

Since I have intended to use genetic algorithms to optimize just weights in a neural network,
I will not mention about how to encode the ANN architecture and topology into a
chromosome. So, I will discuss about encoding strategies of weights briefly.

For weight based representation, the first basic decision should be made about which
method will be used between bit strings representation and real value representation.

One of the approaches for representing the connection weights is using bit strings. In this
approach each weight value would be represented as a binary string and then decoded into
real values between, say, -10, 10 or -1, 1. This type of representation has some drawbacks.
When the number of connections increase, the length of the binary strings increase to a size
which causes to slower evolution process.
(Evolutionary_algorithms_for_neural_network_design_and_training.pdf, Soft Computing,
Integrating Evolutionary, Neural, and Fuzzy systems, Chapter: 4 section: 4-2 p: 125)

                                                 4                                    ...
                I0                                            5

                I1                                       -2
                                                 6                                    ...

0   0   0   0   0    1   0   0   0   0   0   0       0   1    1   0   0   0   0   0         0   1   0   1   1   0   0   0   0   0   1   0

|---------------4--------------|--------------6------------------|-----------------5-----------------|-------------- -2

In this sample, each weight value is represented with 8 bit and first bit is used as a sign bit.
So with 7 bit, we can encode the weights between -127, 127. It is suggested in (Combining
Genetic Algorithms and Neural Networks the Encoding Problem) that, one extra bit can be used to
encode if the connection is exists at all or not.

Another approach to encode weights is to use real value representation. Instead of bit
strings, weights are coded directly as real values. Most of the current genetic algorithm
weight optimization methods use real value representation.

For same network illustrated above, the real value representation is shown below;
                                   4.2                      ...
               I0                             5.4

               I1                          -2.0
                                    6.1                     ...

         4.2                      6.1                       5.4                     -2.0

Real values can be implemented as integers or floats and they are usually kept in a certain
interval. When initialising the first population and assigning random values to chromosomes,
usually a pre-defined interval is chosen and numbers are generated randomly in that
interval. Usually, weight values are assigned between the interval of -10, 10 or -20, 20.
(Neural network weight training by mutation, USING A GENETIC ALGORITHM FOR

The other important answer that is supposed to be answered is in which order weights are
supposed to code into the chromosome. If you consider crossover operator (which will be
discussed in following sections in detail), it is more likely that the genes far from each other
are disrupted compared to close ones. See the following figure,

Gene1    Gene2      Gene3    Gene4        Gene5     Gene6   Gene7    Gene8     Gene9       Gene10

                                                    Cut point

As you can see it is more likely that gene2 and gene9 will be disrupted compared to gene2
and gene3. Because of this feature, it is recommended that to organize genes tightly which
are functionally close. In other words, it is recommended to organize all the incoming and
outgoing weights of a hidden node next to each other.

Selection of the encoding model strongly depends on the problem we are trying to optimize.
However, we should keep in mind that both binary strings and real value encoding have
some drawbacks. One of the drawback in terms of bit strings is that if the network structure
gets more complex, then chromosome lengths will be too much which result the slow
evolution process as mentioned above. However, the drawbacks of real value code are
utmost important.

One of the problems with real value implementation is that mutation is the only way to
change weight values. What I mean is that, changing the values between two chromosomes
only results another weight combination of parent chromosomes. However, considering bit
string representation, changing bits between chromosomes results a different weights
combination. So, this is one of the weaknesses of real time implementation. (Training of
Neural Networks by means of Genetic Algorithms Working on very long Chromosomes)

//CAN BE EXPRESSED THE REDUNCANSIES in this section as well


Mutation operator simply changes the value of some genes in a chromosome randomly. For
bit string encoded chromosomes, mutation accurse by flipping some random bits. Mutation
rate in evolution process decides the probability of mutation in a population. Suppose if the
mutation rate is 2%, it means two individuals in a population with size 100 will be mutated.
See the diagram below to see the effect of mutation for a bit string represented

0                   0                 1                        0             1

0                   1                 1                        0             0

Chromosome before mutation: 00101
Chromosome after mutation: 01100

It is bit different when real value representation is used. A mutation operator adds or
subtracts an integer and shifts the weight value in a small amount. The value that will be
added or subtracted can be chosen in an interval (1...α) where α is the amount of mutation.


Crossover operator creates a new individual by selecting genes from parents randomly.
There are different kind of cross over implementation but all of them rely same principles.
First check how crossover operator process on bit string based chromosomes.

                  Parent 1                                Parent2

        0     0         1    1   0        1       1        0       0   0

                                 0    0       1       0        0
Above figure shows one-point cut crossover. There exists two or three points cross over
operators as well. By cross over operator, parents change their genetic materials and they
create a new individual as a result.

Fitness Function

Selecting fitness function is one of the most important decision on Genetic Algorithm since it
decide how well the solution for the specified problem. Besides, it id decided whether a
chromosome will transfer its genetic material to next population or not depending on the
chromosome`s fitness value.


Briefly, selection means the fittest individual survives. Selection is made based on fitness
values of the chromosomes. The chromosome that has the bigger fitness value compared to
others is more likely that it will survive in next generation as well. A couple of different
selection methods exist such as tournament-based selection or roulette-wheel selection.

Tournament based selection simply selects two random individuals from population and it
saves the one with bigger fitness value and discards other one.

Roulette-wheel selection xxx

If we sum up all the steps, we can summarise the process with fallowing flowchart
So, lastly, what are the termination criteria while evolving the population? Both best fitness
value depending on the error rate and generations number is important when termination
criteria is concerned. Sometime reaching a certain error value cab terminates the progress or
reaching the maximum generation number as well. Moreover, sometimes both of them can
be used. It is totally up to problem domain.

Evolutionary Algorithms on ANN Training

Evolutionary algorithms have being used for years and successful results have being
obtained. (xxx)

The first step is to create initial population. Each chromosome composes of weight as much
as connection number in ANN. Each chromosome represents a possible solution set of
weights for ANN.

When training the neural network with the help of genetic algorithms, fitness values are
calculated according to error rate of ANN for a specific chromosome. The pseudo code has
shown below;

 for each chromosome in Population
       Load weights to ANN
       For each data set in training samples
               Load inputs to ANN
               Feed forward the input
               Calculate output
               Compare actual and desired output
               Cumulate the error value


After processing all the learning data samples, an error value is obtained for the
chromosome. We can use this error value as a fitness value of the chromosome. So, the
smaller fitness value is, the more successful the chromosome is to solve that problem

   2.3 Differential Evolution Method

Differential evolution is another kind of evolutionary programming method with some
differences in mutation and crossover operator.

In differential evolution method, new individuals are generated by the combination of other
chromosomes from the population. As mentioned in previous sections, mutation operator is
applied to chromosomes simply changing some random bits or adding/subtracting some
values. However, in differential evolution, two chromosome vectors are selected randomly;
their difference is calculated, after multiplying the difference with a mutation constant, we
add it to another randomly selected chromosome.

Let`s say w represent the chromosomes, g for generation and k for indexing the individual in
population. So, let`s say mutant vector    can be generated according to following rules;
      =          +µ(        –         )

      =         +µ(         –     )

      =         +µ(         –     )+µ(             –        )

Where            is the best individual of the gth generation and r1, r2 and r3 is mutually
different and different from the running index i. As you can see, differential evolution
method is in a form of combining other individuals and generating a trial chromosome as a
result. µ > 0 is a real parameter called mutation constant, which controls the amplification of
the difference between two individuals.

After generating mutated individual, crossover (or recombination) operator is applied to the
mutated individual. To decide which gene will be transferred to new individual, a random
number is generated between [0, 1]. If random number is less than a crossover rate, the
gene transferred from trial mutant vector, otherwise from target vector. This processes is
illustrated below,

Population size = 10
Suppose for chromosome 2,

Ch1       Ch2         Ch3       Ch4       Ch5          Ch6      Ch7         Ch8      Ch9    Ch10

                                                +µ(                         -                )

                                                                      Trial vector

                            [0, 1]

                                                Result vector

Figure DE evolution

After generating result vector, select it if its fitness value is less then original one. Otherwise,
keep the original one.
  2.4 CUDA


CUDA (Compute Unified Device Architecture) is a both hardware and software model which
allows us to run parallel implemented code on GPU without the need of mapping to any
graphics API. CUDA scales the programs to 100`s of cores and 1000`s of threads running
simultaneously. CUDA allows programmer to focus on parallel algorithm instead of the
mechanism of a parallel programming language or graphics APIs.

The program part which runs on GPU simultaneously is called “Kernel” in CUDA literature.
One kernel is executed on GPU at a time and many threads (1000s) execute each kernel.
There are important differences between CPU threads and GPU threads.

       CUDA threads are extremely lightweight which means very little creation overhead
        and instant switching.
       CUDA uses 1000s of threads to run efficiently. However, CPU can use only a few

In CUDA literature, CPU is called as “Host” and GPU is called as “Device”.


CUDA also provides a thread hierarchy in order to make kernel mapping to GPU easier.
Thread hierarchy is also defines which thread can reach which memory locations and also it
is used for thread synchronization.

Each thread runs the same kernel issued by host and has its own local memory and registers.

                                                 Local Memory

 Threads are batched in thread blocks. Each block can consist of up to 512 threads and
threads in the block can be organized as 1D, 2D or 3D. Each thread inside the same thread
block has access to shared memory and can cooperate via shared memory.

                                                         Shared Memory

                                Thread Block
Threads in each block can be synchronized however; threads in different blocks can be
synchronized. Moreover, threads in same block can cooperate via shared memory.

Each thread has given a unique thread id when the kernel is lunched. This thread id variable
is a built-in variable and can be assessed by x, y and z components as threads can be
organized as 1D, 2D or 3D.

Thread Block

  (0, 0)        (1, 0)        (2, 0)        (3, 0)

  (0, 1)        (1, 1)        (2, 1)         (3, 1)

  (0, 2)        (1, 2)        (2, 2)        (3, 2)

  (0, 3)        (1, 3)        (2, 3)         (3, 3)

As you can see, threads in a 2D block is identified with x and y components and can be
accessed via built-in variable threadIdx

Thread blocks are required to execute independently since they can be executed in any
order, in parallel or in serial. This independence requirement allows thread blocks to be
scheduled in any order across any number of cores.

Thread blocks compose “grid” and grids can be composed of 1D or 2D thread blocks as
shown below;
  Shared Memory     Shared Memory    Shared Memory      Shared Memory      Shared Memory

Each block is given a unique block id and can be accessed via blockIdx built-in variable. A grid
can consist of 232 blocks resulting in a total of 241 threads.

Block size also can be reached via blockDim variable. It is a 3D variable and to get the size of
each dimension, we can use x, y or z component of the variable.

There is no way to synchronise thread blocks. Only threads in same block can be synchronized. And
different blocks cannot use each other`s shared memory. Each thread has an access to global memory
at grid level.

                                                        Global Memory

Kernel1 <<Grid>>

Kernel 2<<Grid>>

Software model of CUDA highly coupled with hardware model. So, in order to understand
software model like thread hierarchy or memory hierarchy, it is better to have a look to
hardware model of CUDA.


Let us see the hardware architecture of NVIDIA G80 series GPU which is capable of running
CUDA as well.

                                                                                     TP     TP

                                                                                     TP     TP

                                                                                     TP     TP

                                                                                     TP     TP

TP = Thread Processor

As you can see G80 series GPUs has 16 Multi-processors which means 16x8 = 128 thread
processor. Each multiprocessor has 8 thread processor and each thread processor in same
multi=processor can access share memory of the multi-processor.

A grid is executed on the device by scheduling thread blocks onto the multiprocessors.
Thereby, each block is mapped to one multiprocessor. Multiple thread blocks can be mapped
onto the same multiprocessor and are executed concurrently. If multiple blocks are mapped
onto a multiprocessor, its recourses, such as register and shared memory, are split among
the mapped thread blocks. This limits the amount of thread blocks that can be mapped onto
the same multiprocessor. It is called occupancy the ratio of active threads on a
multiprocessor to the maximum number of threads supported by a multiprocessor.

 CUDA devices use several memory spaces which has its own characteristics that reflect their
usage in CUDA applications. These memory spaces include global, constant, shared, texture,
local and register as shown below,

  HOST                               DEVICE
                                        DRAM               GPU
                CPU                                                  Multi Processor
                                                                     Shared Memory

   DRAM          Chipset                Texture                  Multi Processor
                                                              Multi Processor

As you can see, host can read and write to Global memory, texture memory and constant
memory, however, constant memory and texture memory are read only memory spaces for
GPU. In the scope of this project, only global memory is used to transfer data to GPU.

As you can see, global, texture and local memory spaces reside in device e DRAM location,
they have the greatest access latency (although texture is cached), followed by constant
memory, registers and shared memory.

The various principles traits of the memory types are shown below,
Memory         Location        Cached          Access          Scope                  Lifetime
               on/off chip
Register       On              n/a             R/W             1 thread               Thread
Local          Off             No              R/W             1 thread               Thread
Shared         On              n/a             R/W             All threads in block   Block
Global         Off             No              R/W             All threads + host     Host
Constant       Off             Yes             R               All threads + host     Host
Texture        Off             Yes             R               All threads + host     Host


CUDA programs can be written either using C for CUDA or CUDA Driver API. C for CUDA is
selected in this project since it has C style syntax and easier to learn.

C for CUDA extends C and allows programmer to define functions, called “kernels” , and
executed N times in parallel by N different CUDA threads unlike traditional C functions.

A kernel is defined with a __global__ declaration specifier which means it will be called by
host. Kernel calls from host code is done by kernel name and parameters followed by <<<>>>
syntax to provide some attributes such as block size, grid size, shared memory size per block
etc... Beside kernel functions have to be “void” return type. A sample declaration and host
code showed below;

 void __global__ vectorAdd ( float* vec1, float* vec2, float* result

           //add operation

 int main (){
      vectorAdd<<<1, N>>>(A, B, C);


It invokes a kernel call with just one block in grid and N thread in block.

As I mentioned in previous sections, each thread is given a unique thread id which can be
accessed via built-in threadIdx variable in device code. threadIdx is a dim3 variable type
which is a built-in type in CUDA and has x, y and z components. Just for illustration, we can
improve the code show above with threadIdx variable,
    void __global__ vectorAdd ( float* vec1, float* vec2, float* result
         //add operation
         int thread_id = threadIdx.x;
         result[thread_id] = vec1[threadId] + vec2[threadId];

    int main (){
         vectorAdd<<<1, N>>>(A, B, C);


This kernel will be executed by several threads with different thread Ids. So that, each thread
will use thread id as an index and will access the variable using that index in arrays and will
perform the adding operation. See the diagram below in order to see how this adding
operation is made in parallel.

0       1   2   3   0       1   2   3               0     1     2       3

                    Thread id = 0                                   write


Most of the kernel codes use the unique thread Ids in order to access data in parallel. So it is
important to identify each thread correctly. To do this, we use built-in variables like
blockDim, blockIdx and threadIdx. Device memory has a serial and one dimensional memory
layout physically like other memory spaces like main memory in host. So that, 2D and 3D
thread ids are all virtually and just make the mapping easier for programmer. That is why,
thread ids are supposed to converted to unique thread ids in grid. The most common pattern
to find the unique thread id is blockIdx.x * blockDim.x + threadIdx.x
                            0                             1                              2

      0   1   2    3   4            0   1   2    3   4             0   1   2    3    4

    0     1   2    3    4          5 6      7    8    9            10 11 12 13        14

blockIdx.x * blockDim.x + threadIdx.x       blockDim.x = 5

Threads within the same block can cooperate among themselves by sharing data through
shared memory and synchronizing their execution to coordinate memory access.
Synchronizing among threads in same block is done via __syncthreads () function provided
by CUDA. This functions acts as a barrier at which all threads in the block must wait before
any is allowed to proceed.

Previously mentioned, host and device have separate memory spaces and data is transferred
from host to device via PCI bus. Host manages device memory, so CUDA provides some c like
functions to allocate, free and copy memory locations. The important point is that
programmer should keep in mind that pointers to memory locations are just numbers. That
is why; we cannot tell from the pointer value whether it shows a memory location on CPU or
GPU. So, dereferencing device pointer on host most probably crashes the program and same
for vice versa.

Functions provided by CUDA to manage device memory is easy to use as they look like C
memory operation functions. Mainly, there are three kinds of important memory managing
functions which are frequently used while developing a CUDA application. These are,

  cudaMalloc(void** pointer, size_t nBytes);

  cudaFree(void* pointer);

  cudaMemcpy(void* dest, void* src, size_t nBytes, enum
  cudaMemcpyKind direction);
enum cudaMemcpyKind can be any of the followings,

         cudaMemcpyHostToDevice
         cudaMemcpyDeviceYoHost
         cudaMemcpyDeviceToDevice

As you can see, CUDA interface provides memory copy operations between host and device,
device to host and device to device. The point which should be kept in mind when memory
operation functions are being used is that, programmer should be careful about which
pointer is a device pointer and which one is a host pointer. In order to make the
programming easier, there can be used a specific notation when declaring a variable such
that “h_” prefix for host side variables and “d_” prefix for device side variables. Throughout
the thesis, this notation will be used to manage pointer easier and prevent the memory

// Device code
__global__ void VecAdd(float* A, float* B, float* C)
    int i = blockDim.x * blockIdx.x + threadIdx.x;
    if (i < N)
        C[i] = A[i] + B[i];

// Host code
int main()
    int N = ...;
    size_t size = N * sizeof(float);

     // Allocate input vectors h_A and h_B in host memory
     float* h_A = (float*)malloc(size);
     float* h_B = (float*)malloc(size);
    //initialise vectors

        // Allocate vectors in device memory
        float* d_A;
        cudaMalloc((void**)&d_A, size);
        float* d_B;
        cudaMalloc((void**)&d_B, size);
        float* d_C;
        cudaMalloc((void**)&d_C, size);

    // Copy vectors from host memory to device memory
     cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice);
     cudaMemcpy(d_B, h_B, size, cudaMemcpyHostToDevice);

        // Invoke kernel
        int threadsPerBlock = 256;
        int blocksPerGrid = (N + threadsPerBlock – 1) / threadsPerBlock;
        VecAdd<<<blocksPerGrid, threadsPerBlock>>>(d_A, d_B, d_C);

        // Copy result from device memory to host memory
        // h_C contains the result in host memory
        cudaMemcpy(h_C, d_C, size, cudaMemcpyDeviceToHost);
     // Free device memory

Above code demonstrates the basics of memory operations such as allocate, copy and free.
First of all, memory is allocated on host side by “malloc” functions then device memory is
allocated by “cudaMalloc” CUDA functions and after, vectors are copied form host to device.
Finally kernel is lunched and pointers are freed. These steps are the very basic and common
patterns for memory operations while writing a CUDA application.

Until now, we have talked about allocating variables and one dimensional array. What
happens if we need 2D or 3D arrays on device? In this case, CUDA recommends to the
developers to use cudaMallocPitch () and cudaMalloc3D (). These functions are
recommended for allocation of 2D or 3D arrays as it makes sure that the allocation is
appropriately padded to meet the alignment requirements. The returned pitch must be used
to access array elements. The following code sample allocates a width x height 2D array of
floating-point values and shows how to loop over the array elements in device code:

// Host code
float* devPtr;
int pitch;
cudaMallocPitch((void**)&devPtr, &pitch, width * sizeof(float),
myKernel<<<100, 512>>>(devPtr, pitch);

// Device code
__global__ void myKernel(float* devPtr, int pitch)
    for (int r = 0; r < height; ++r) {
        float* row = (float*)((char*)devPtr + r * pitch);
        for (int c = 0; c < width; ++c) {
            float element = row[c];

Kernels are C functions with some restrictions,
    Cannot access host memory
    Must have void return type
    No variable number of arguments (“varargs”)
    No recursion
    No static variables

The other thing about kernels is that the arguments are copied from host to device
As I previously mentioned, CUDA provides a language extending C language, so that CUDA
provides some function qualifiers to adapt factions to CUDA needs.
       __global__
            o Functions called from host and executed on device (kernel functions)
            o Must return void return type
       __device__
            o Functions called from device and run on device
            o Cannot be called from host code
       __host__
            o Functions called from host and executed on host (default)
            o __host__ and __device__ qualifiers can be combined to generate both CPU
               and GPU code

We have seen how to lunch kernels in previous code samples, however we haven’t
mentioned about what is the parameters of kernels in terms of GPU configurations.

Kernel<<<grid, block, stream, shared_memory>>>

       grid :                   grid dimension up to 2D (dim3 type)
       block:                   block dimension up to 3D (dim3 type)
       stream:                  stream ID (optional)
       shared_memory:           number of bytes per block (optional)

Until now, we have used some defined built-in variables in CUDA. These built-in variables can
be accesses in __global__ and __device__ functions. These variables are listed below,

       dim3 gridDim;
           o Dimensions of the grid in blocks (at most 2D)
       dim3 blockDim;
           o Dimensions of the block in threads
       dim3 blockIdx;
           o Block index within the grid
       dim3 threadIdx;
           o Thread index within the block

        Dim3 is a vector type extending int and can be accessed using x,y and z components.

There are also a couple of built-on variable qualifiers defined by CUDA framework which are;

       __device__: If a variable is identified with this qualifier, it means the variable is
        stored in global memory which is large and has high latency and no cache. These
        kinds of variables are allocated via cudaMalloc described above and accessible by all
       __shared__: These kinds of variables are stored in on-chip shared memory which has
        very low latency. They are specified by execution configuration or at compile time
        and accessible by all threads in same block.

Unqualified variables like scalars and built-in vector types are stored in registers and if they
do not fit in registers they are spills to local memory.

3. Methodology

Mainly, I have implemented the project in two parts. One of them is the serial
implementation of neural network trained by genetic algorithms on CPU and the other one is
on GPU using CUDA.


The first design decision while implementing CPU code was to implement it as low level as I
can do. Because, there is no need to implement hundreds of classes and to encapsulate
everything if it is possible to solve the problem easier and with less coding. Another thing is
that, it is easier to port low level code to GPU then object oriented design and coding.

Another reason for doing both serial and parallel implementation is that, before
implementing parallel, to understand the basics of neural networks and genetic algorithms.
Another thing is that, in order to measure performance between parallel and serial
implementation and to show how fast the parallel implementation is. As I propose that,
parallel implementation on CUDA is much faster than serial implementation which means,
most of the AI workload on CPU can be shifted to GPU and AI in games can be run faster on
GPU allowing CPU to run other stuff.

First of all, ANN composes of nodes and its connections. It is clear that, nodes can be
represented as a 2D array and can be identified by its layer number and node number in that
layer. Similarly, connections can be represented as a 3D array and a connection can be
identified as the connection between the given neuron in given layer and the given neuron in
previous layer.

These arrays could be defined static arrays with a size. However, it would waste memory
spaces as all the layers have not the same number of neuron and moreover, layer numbers
and neuron numbers should be able to given from an external source so that those numbers
are not static. Considering all of these, the best solution is to implement nodes and
connections as pointers. That is why “**” (pointer-to-pointer) and “***” pointer-to-pointer-
pointer are used to define neural network. You can see the diagrams defined below to see
how neural network is represented in memory using pointers.

This diagram shows how a 3 layer (3-4-2) neural network can be implemented by pointers.
Neurons are simple 2D dynamic allocated arrays in memory as we can identify each neuron
using two indices. Layer number and neuron number in that layer.
                                                                              X[0][0] = neuron a
      x                                                                       X[1][1] = neuron b


                                Pointer array
                              (Indicates layers)
                                                                                 Double arrays
                                                                                 (Indicates neurons)


Weights are represented as 3D arrays as we can identify weights using three parameters.
Layer number, neuron in that layer and neuron in previous layer.

***                  Pointer-to-pointer array ** (per layer)

                                                                       Pointer array * per neuron
                                                                       (for each pointer, allocate
                                                                       space as much as the neuron
                                                                       numbers in previous layer)

                                                                              We can index a as
                                    a                                         [2][0][2] means
                                                                              weights between
                                                                              the first neuron in
                                                                              layer 3(output layer)
                                                                              and neuron 3 in
                                                                              previous layer (layer

So why needed lots of classes like Neuron class, layer class, neural network class? Neurons
are just 2D arrays and weights are just 3D arrays.

Dynamic memory allocations were done according the fallowing code snap.

//allocate memory for weights
double *** m_neuronWeights = new double **[m_numberOfLayers];
for(int i = 1 ; i < m_numberOfLayers ; i++){
      m_neuronWeights[i] = new double*[m_numOfNeurons[i]];
for(int i = 1 ; i < m_numberOfLayers ; i++){
      for(int j = 0 ; j < m_numOfNeurons[i] ; j++)
      m_neuronWeights[i][j] = new double[m_numOfNeurons[i-1]+1];
Above code demonstrate the dynamic weight allocation for ANN. The important point that
should be considered is that, biases are not implemented as a separate node in layers,
instead of that, for each neuron an extra connection is added for bias value.

Artificial neural network related code has been encapsulated with an ArtificialNeuralNetwork
class which serves a couple of functions to process ANN. The important ones are;

void           feedForward               (double* input);
double         getOutput                 (int index);
double         getMeanSquareError        (double* target);
void           loadWights                (double*);
void           release();
inline        double sigmoid(double inValue);

Most of the functions are self-explanatory as their names indicate what they do. However, I
want to make it clear which methods are used for two of them. To calculate the error value
of the network, the following equation is used,


And as a transfer function sigmoid function is used;

                        F(x) =

So that, our neural network processes a continuous space between the interval (0, 1).Since it
is an open interval, which means function value never reaches the 0 and 1 values, we
interpret the values close enough to 1, as 1 such as 0.999999. We have to keep it in mind
since, if our neural network produces a Boolean result like true and false, and if we interpret
the result of the network like if it is 1 then true otherwise false, we make a important
mistake as it never reaches to values 0 and 1.

Release function simply frees the allocated resources for neural network. So that, before
terminating the program, release function of ANN should be called.

In serial implementation, all constants related to ANN and GA is stored in a header file. So
that ANN layers and neurons are defined in this header in order to use these values during
creation process of ANN.

These constants should be changed to create ANNs with different configurations. These
parameters are,

const   int   INPUT_NEURON_NUMBER = 1;
const   int   HIDDEN_LAYER_NUMBER = 2;
const   int   HIDDEN_LAYER1_NEURON = 3;
const   int   HIDDEN_LAYER2_NEURON = 3;
const   int   OUTPUT_NEURON_NUMBER = 1;
const std::string TRAINING_DATA_FILE_NAME = "sin_1_5.dat";

The most important function of ANN is feed Forward function as it is used to train network
as well as getting the real output related to an input. It simply, loads the input pattern to
input layer and then it feed forwards it with current weight values and loads the output to
output layer.

void ArtificialNeuralNetwork::feedForward(double *input){
      double sum;
      //first set input layer with input pattern
      for(int i = 0 ; i < m_numOfNeurons[0] ; i++){
            m_neuronValues[0][i] = input[i];


        //then feed forward it
        //for each layer in neural network
        for(int i = 1 ; i < m_numberOfLayers ; i++){

                //for each neuron on a particular layer
                for(int j = 0 ; j < m_numOfNeurons[i] ; j++){

                        sum = 0.0; //reset sum
                        for(int k = 0 ; k < m_numOfNeurons[i-1] ; k++){

                sum += m_neuronValues[i-1][k] * m_neuronWeights[i][j][k];
                      //add bias
                      sum +=m_neuronWeights[i][j][m_numOfNeurons[i-1]];

                        m_neuronValues[i][j] = sigmoid(sum);



Before explaining genetic algorithm based training, it is better to explain how to manage
(reading, parsing and formatting for training) training data. These files usually composes of a
couple of values (int, float or double) depending the input and output numbers. For instance,
to approximate sine function using ANN, we can use one neuron in input layer and one in
output layer. So that, a training data file can be meaningful containing two float numbers in
each line for one value whose sine value will be calculated and one value for the real sine
value of the input.

A sample training data file for above example can be,
//Input value                    output value
     0                           0
     2.5                         0.0436194
     5                           0.0871557
     7.5                         0.130526
     10                          0.173648
     12.5                        0.21644
     15                          0.258819
     17.5                        0.300706
     20                          0.34202
     22.5                        0.382683
     25                          0.422618
     27.5                        0.461749
     30                          0.5

These values should be read and given to ANN in an appropriate format. To do this, I have
implemented a generic data reader class which is capable of handling all kind of data (int,
float, and double) with any number. It is a template class and had two vectors inside it for
inputs and outputs. Moreover, it is a singleton class so that no need to be created an
instance in order to use. A sample usage of the class is showed below,

Vector<DataPair<double>> =

With get instance, we obtain a data set reader reference, and then we parse the file. Parse
file function is a generic function and can be used with different types. Currently, double and
int data types are supported as float can be handled as double. So, from function call, we can
understand that we will read a file containing double values. First attribute of the function
shows which file we will read, second number is how many of them input values, third one
shows how many of them output values, and finally we use an enum (TYPE::DOUBLE or
TYPE::INTEGER) value to indicate to which type we will cast the read string from file as we
can understand the type from template parameter. That is why we have to indicate what
type we are reading second time.

Return type of the function is vector<DataPair< {double OR int}>> DataPair is a structure
defined in DatasetReader header file which is composed of one input vector and one output


The most important part of ANN implementation is the training process. For training, back-
propagation algorithm is used widely, however, due to the drawbacks that I have mentioned
in Literature review section; I have decided to use genetic algorithms to optimize ANN
weights. GAs is also more appropriate to implement in parallel fashion.
Firstly, I have created the structure of chromosomes. A chromosome is simply a double array
which is composed of neural network weights. For the representation of chromosome
composed of weights, please see the figure below,

                                        4.2                       2.3
              I0                                 5.4

              I1                              -2.0
                                        6.1                       1.1

double*            4.2            6.1                5.4          -2.0            2.3             1.1

The first step is creating an initial population with random values. Maximum population
number that will be created defined in Constants header file mentioned previously. It is
suggested that, the population size should be 2X < P < 4X where X the total connection
number on neural network.

In order to generate uniform random numbers, an external number generator library has
been used. ( http://local.wasp.uwa.edu.au/~pbourke/miscellaneous/random/ ) It is
said that;

“It passes ALL of the tests for random number generators and has a period of 2^144, is completely
portable (gives bit identical results on all machines with at least 24-bit mantissas in the floating point
 representation). The algorithm is a combination of a Fibonacci sequence (with lags of 97 and 33,
and operation "subtraction plus one, modulo one") and an "arithmetic sequence" (using

To use random number generator, firstly we have to initialize it. In order to initialise, two
seeds are given to generator. In the header file of generator, it is defined that,

“NOTE: The seed variables can have values between: 0 <= IJ <= 31328
0 <= KL <= 30081”

The initialization routine used in this project is shown below,

RandomInitialise (1802,19373);

After initialisation process, we can start to generate random numbers with following
double RandomGaussian(double,double);
int    RandomInt(int,int);
double RandomDouble(double,double);

So that, it is a high precision random number generator which will allow us to generate
balanced and uniform individuals. Initial population is utmost important since all the
operations such as crossover and mutation will process over these values.

In this step, we have our initial population with randomized values and we have our training
samples in memory and ready to train neural network. I have used sine function as a
benchmark function in order to train my ANN and see how efficient the implementation is.
So that, I have created 5 different training data sets with 0.25, 0.5, 1, 1.5 and 2.5 degree
differences between 0 and 90 degree. Each line has two values. One for degree and another
one is sine value of the degree. So, our neural network has to have one input neuron and
one output neuron. Training files have a “.dat” extension and have been named
sin_0_25.dat, sin_0_5.dat, sin_1_0.dat, sin_1_5.dat and sin_2_5.dat depending on their
sampling interval.

During evolution process each cycle has mutate, crossover and select phase. In every cycle,
generation number is increased. The termination condition of the evolution process depends
either reaching the maximum generation number or reaching the minimum desired error
rate. So the main cycle in terms of evolution process is,

while(g->getMinFitness () > DESIRED_ERROR_RATE && g-
>mGenerationNumber < MAX_GENERATION_NUMBER){
            g->cycle ();

As you can see desired error rate and max generation number values are constants and they
are defined in constants header file. Default values are 0.005 and 2000 for error rate and
max population number respectively. You can change the parameters in order to get
different results.

Cycle method is the main function where the evolution process is takes place. This methods
executes mutation, cross over and selection operators for each individuals. If new trial vector
decreases the error rate, it is accepted new individual, otherwise it is discarded. Please refer
to Figure DE Evolution.

All the functionality related to genetic training is encapsulated within “GeneticTraining”
class. This class also includes an ANN instance for training purpose. So all the parameters
related to genetic process and ANN should be passed to constructor. The constructor of
GeneticTraining class is shown below,

GeneticTraining(        int numOfLayers ,
                        int* neuronsNum ,
                        int MaxPopulation,
                        float mutRate ,
                        float crossRate );

numOfLayers : Total number of layers of ANN
neuronsNum: Indicates the neuron numbers in each layer
MaxPopulation: Indicates max number of population
mutRate:mutation rate that is used during mutation process
crossRate:crossover rate that is used during crossover process

For all these parameters, their default values are defined in constants header file. If any of
them is not defined in constructor, it means their default values will be used.

Now we can describe the steps used during genetic algorithms based training in detail. First
step is initializing the population and calculating the fitness values of the initial population.
Calculation fitness function is one of the most expensive operations during training process
as it calculates every individual`s fitness functions and for each of them it process ANN for
each entry in training dataset file. So, if we have 1000 individual and our training data has
5000 training samples, it means for every fitness calculation cycle, 5000 x 1000 = 5.000.000
calculations will be done which is so expensive.

To store fitness values, a double array with the same size of population is used, so, if our
fitness array is;

Double* fitness_array = new double[MAX_POPULATION_NUMBER];

Then fitness_array[25] stores the 25.th individual`s fitness value. So, after initialising the first
population with random weights, firstly their fitness values are stored and found the best
fittest one.

void             GeneticTraining::cycle (){

        Chromosome m;
        Chromosome c;
        Chromosome s;

        for(int i = 0 ; i < mPopulationSize ; i++){

                 m = mutate(i);
                 c = crossover(mChromosomes[i], m);
                 s = select(mChromosomes[i], c)
                 mChromosomes[i]= s;
        printFitness ();

In every cycle, for each individual, first mutation operator is applied and a trial vector is
generated as a result, then, crossover operator is applied to this trial vector and the original
individual and the one that has a lesser fitness value is selected and replaced with original
individual. Let us see each operator in more detail.

Firstly, mutate operator is applied to selected individual.
Chromosome      GeneticTraining::mutate(int index){

        if(index == bestIndividual) return mChromosomes[index];

        int r1,r2,r3;
              r1 = RandomInt (0,mPopulationSize-1);
              r2 = RandomInt (0,mPopulationSize-1);
              r3 = RandomInt (0,mPopulationSize-1);

        }while(r1 == r2 || r1 == r3 || r2 == r3 || r1 == index || r2 ==
        index || r3 == index);

        Chromosome diff = new double[mWeightConNum];

        for(int i = 0 ; i < mWeightConNum ; i++){
              diff[i] = mChromosomes[r1][i] + (MUTATION_RATE *
              (mChromosomes[r3][i] - mChromosomes[r2][i]));

        return diff;


For mutation operator,         =       +µ(        –    ) formula is used as it has been
described in Differential Evolution section.

So, three mutually exclusive individuals are selected which are also different from the
original individual. After, the formula shown above is used to calculate mutated trial vector.
Mutation rate is a constant defined in constants header file and default value is 0.2 as I have
realized that it is the optimum value depending on my experiments on ANN. After applying
mutation operator, crossover operator has been applied to trial individual.

crossover(Chromosome source, Chromosome trial){

        Chromosome      crossedTrialIndividual = new double[mWeightConNum];

        for(int i = 0 ; i < mWeightConNum ; i++){

                double ran = RandomDouble (0.0, 1.0);
                if(ran < CROSSOVER_RATE){
                       crossedTrialIndividual[i] = trial[i];
                       crossedTrialIndividual[i] = source[i];
        return crossedTrialIndividual;
For crossover operator, the result vector of mutation operator and original selected
individual are used. A random number between 0 and 1 is generated for each weight
component of the individuals. If generated number is less than CROSSOVER_RATE defined in
constants header, the weight in trial vector is transferred to new trial individual; otherwise,
the weight value from the original vector is chosen.

 O1                                                M1
 O2                                                M2

 O3        If generated           If generated
           number                 number isles     M4
           greater than           than
 O4        crossover              crossover rate
 O5        rate                                    M5

 O6                                                M7
Original                  T0                  mutated vector






1-Application Design

Each thread in CUDA application is responsible in evolution of each chromosome in
population. Evolution phase consist of selection other random chromosomes for mutation,
crossover and selection. Throughout selection phase, an error rate is calculated using ANN
and training data. So, each thread has to complete these steps for a single chromosome.

I have chosen 1D grid and block structure to run my application. The reason is that, our data
structures are usually 1D array so it is easier to map to 1D grid. Besides, if I had used data
structures like matrices or images, it would have been easier to map them 2D grids and
blocks. The number of blocks in a grid is decided by population number in GA. So, grid size is
determined by MAX_POPULATION / BLOCK_SIZE equation. Block size must be chosen such a
number that it can keep all the processors in GPU busy. So usually, 128 or 256 threads per
block are ideal for our application. (Find) For that reason, it is advised for our application,
population size should be chosen multiplies of BLOCK_SIZE.

So, the structure of application in hardware level is shown below,

                           0                             1                                      2



Chromosome0     Chromosome1     Chromosome2        Chromosome3   Chromosome4      .....

Chromocome array

There are three different kinds of configuration header files. These are,

        Cuda_Config
        Neural_Network_Config
        Genetic_Algorithm_Config

Cuda_Config defined the parameters for CUDA application. The only parameter which is
defined in cuda config file is BLOCK_SIZE parameter.

const int       BLOCK_SIZE       =       128;

Neural_Network_Config file defines the parameters related to ANN topology and training.
These are;

const    int    INPUT_NUM                =   1;
const    int    OUTPUT_NUM               =   1;
const    int    LAYER_NUM                =   3;
const    int    NEURON_NUM               =   8;
const    int    CONNECTION_NUM           =   19;

const int layerNums[LAYER_NUM] = {1,6,1};
“layerNums” array defines the neuron numbers in each layer. Besides there variables,
training data structure is defined in this header file as well.

Genetic_Algorithm_Config file defines the constant in terms of genetic algorithm (DE) used
in this project and their default values are shown below;

const int               MAX_POPULATION                           = 1024;
const float             MUTATION_RATE                            = 0.2f;
const float             CROSSOVER_RATE                           = 0.6f;

1-General Code Design

CPU based implementation of this project is important because of two important reasons.
First, it helped me to understand the basics of Artificial Neural Networks and Genetic
Algorithms especially Differential Evolution method. Secondly, it is a measurement tool to
compare the performance results between CPU and GPU implementation. So, while
implementing GPU based application, the approach was not changed at all.

Firstly, configuration files related to CUDA, ANN and GA were created. In these header files,
configuration attributes such as block size, mutation rate, crossover rate, layer number of
ANN, neuron numbers in each layer etc… exist.

For this thesis, a NVIDIA GeForce 9800 GT graphic card was used. Some of the important
device capabilities are shown below as they are highly coupled with implementation.

Device 0: "GeForce 9800 GT"
Major revision number: 1
Minor revision number: 1
Total amount of global memory: 536150016 bytes
Number of multiprocessors: 14
Number of cores: 112
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 16384 bytes
Total number of registers available per block: 8192
Warp size: 32
Maximum number of threads per block: 512
Maximum sizes of each dimension of a block: 512 x 512 x 64
Maximum sizes of each dimension of a grid: 65535 x 65535 x 1
Maximum memory pitch: 262144 bytes
Texture alignment: 256 bytes
Clock rate: 1.51 GHz
Concurrent copy and execution: Yes

As you see, there are 14 multiprocessors in GPU and each multi processor contains 8 stream
processors. So, it means, there exists 14 * 8 = 112 cores on 9800 GT. For each
multiprocessor, we have 16 KB shared memory storage which is quite fast compared to 512
MB global memory. These values are quite important for our implementation as we have to
consider these constraints when we design our kernels, blocks and thread numbers in each
block as well as caching the data in shared memory etc…
Before explaining the implementation details, I would like to give a brief explanation about
the code structure.

    extern "C" void train                                 __device__ functions

                     Calls                                          uses

    cppIntegration.cu                    Uses
    extern "C" void train                                 GPU_kernels.cu
    (Implementation)                                          (Kernels)


  Cuda_Config.h              Genetic_Algorithm_Config.h      Neural_Network_Config.h

The structure described above is the fundamental units of the program and does not include
all the structure such as random number generators, data file readers etc...

Main.cpp is the entry point of the program. It set up the configuration parameters, read the
training data and starts the training. Training function is an extern function which means it is
implemented in another source file. GPU_kernels.cu CUDA source file includes two kernels.
One of them calculates the first fitness values of initial population and another one makes
the training. During training process, kernel uses some functions related to ANN and GA
such as feed-forward, get error, mutate crossover and find minimum fitness functions which
are defined in GPU_device_functions.cu file.

After this brief information, we can start go into the details of implementation.

2-Data Structure Design

Firstly, to represent chromosomes, one dimensional array is used unlike CPU
implementation. The reason is that, CUDA does not support pointer-to-pointers directly so
you need to make some tricks like allocating a pointer array first then allocation arrays for
each pointers and then copying them or using cudaMalloc2D function. But since memory has
a linear storage space, all the different dimensions are represented as linear eventually.
Chromosome0                      Chromosome1             Chromosome2                      ........
 0      1     2      ...


Assuming that the name of chromosome array is chArr, In order to get a specific
chromosome, we can use the following formula,

Chromosome ch = & chArr [ chromosome_number * TOTAL_CONNECTION]

Beside the change in term of chromosome representation, training data format also has to
be changed in order to be run on GPU. To fit training samples to GPU, an ANNDATA structure
is used to store training values.

typedef struct ANNData{

         float input[INPUT_NUM];
         float output[OUTPUT_NUM];


INPUT_NUM and OUTPUT_NUM variables are the constants that represent input layer
number and output layer number. So, CUDA can decide the size of ANNDATA structure at
compile time which will make our implementation simpler.

The most difficult part of the CUDA implementation was the ANN implementation on CUDA
for me. The reason is the memory restrictions on GPU. As I mentioned in application design
section, each thread is responsible in also calculating the error rate which means you have to
feed forward input data and get the result from output nodes. To do that, each thread has to
run its own ANN instance which is a problem for CUDA. Because, we cannot store a different
ANN instance for each thread due to lack of memory. At the best case, we can think that we
can store an ANN instance in shared memory so that each thread in the same block can use
same instance for ANN process. It does not work due to several reasons. First of all, for a
large ANN topology, its size will be more than 16KB which is the amount of shared memory.
Maybe we can define an ANN structure less then 16KB with pointers, however, it is
forbidden to make dynamic memory allocation on shared memory so that our layers and
nodes should be constant size that is why we have to set up them to their max values in
order to provide flexibility for different topologies. Another reason is that, even we store an
ANN instance to shared memory, there will be synchronization problem. Assume that a
thread started to feed forward a data, at the same time another thread can start feed
forwarding as well and it can override the values from another thread. We can solve that
problem by defining a flag that shows whether ANN in use or not, however, ANN feed-
forwarding will be serialized and there will be no point for parallel implementation.

I have accomplished this problem with a very unique approach. If we think about the way
how ANN process, it is obvious that feed-forwarding is just the accumulated weight and
neuron value multiplication. To do that, the only thing we need is to know neuron values and
weights. We already know the weights for a ANN since they are stored in chromosomes. So,
storing the neuron values will be enough for our implementation. Storing the neuron values
in a array is not a big problem for us since, for a 1, 6, 1 ANN, our array size will be 8 * size of
(float) which is approximately 32 byte which means it is enough to store just neuron values
per thread to run a different ANN instance per thread. Let us see how it works,

float neurons[NEURON_NUM];
int chromosomeIndex = 0;

//load input data to input neurons

for(int i = 1 ; i < LAYER_NUM ; i++){ //for each layer

        //find total neurons in previous layers
        int preNeuronsNum = 0;
        for(int t = 0 ; t < i ; t++){
              preNeuronsNum += layerNums[t];

for(int j = preNeuronsNum, int index = 0 ; j < layerNums[i] +
preNeuronsNum , index < layerNums[i] ; j++, index++){

        float sum = 0;
        for(int k = layerNums[i-1] -1 ; k >=0 ; k--){
        sum += neurons[j - index - (k + 1)] * chr [chromosomeIndex++];
//add bias
sum += chr[chromosomeIndex++];

float calc = expf(-1 * sum);
calc += 1.0f;
calc = 1.0f / calc;

//store value
neurons[j] = calc;

//load output to output neurons

This is the all code which makes feed-forwarding on a ANN. As you can see, there are no
complex ANN data structures. We define just a neurons[NEURON_NUM] array to store
neuron values. For each layer, weight and neuron value multiplication is made for each
neuron and value is stored. Before progressing on layer, the number of neurons in previous
layers is calculated in order to index neuron value array. The conversion between 1D neuron
value array and real ANN topology is shown below,
        Layer 0                   Layer 1                      Layer 2
                                                                    Index = 0
                                  Index = 0
   Index = 0
                                                                      N 2, 0

                                  Index = 1

                                    N1, 1

                                   Index = 2

                          For N 1, 1                                 for N 2, 0

Previous neurons number + index                previous neurons number + index
               1+1=2                                4+0

0                 1                    2                   3                      4
Neurons values array

So that we can represent all neuron values in a 1D array and we can map topology and array
to each other. For instance, for output neuron, loop executes three times as there are three
neurons in previous level. So their values are fetched from neurons array and multiplied by
the weights which are fetched from chromosome. After that, bias value is added (bias weight
is also stored in chromosome) and finally sigmoid function calculates the final output and
store it to neuron. We have to store neuron values somehow as all the neurons except input
layer need the values of previous neurons.

In this method, ANN is represented as just values of products. It allows as running separate
ANN instance for each thread.

Actually, we can improve this method by storing the neuron values for just previous layer
since if we are calculating 3. Hidden layer neuron values, we don’t need to know the neuron
values in first layer as we have already used those values. So we can just use a array with size
of the biggest layer and for each layer calculation, we could update that array instead of
storing all the neurons in ANN.

Another data structure is fitness array. It is simply 1D array with the size of population and
stores the fitness value of the chromosomes by one-to-one basis. It means, if we want to
learn the fitness value of 5th chromosome, we should fetch the value with index 5 Index in
fitness array.
Besides, same data reader class which is used in serial implementation is also used in CUDA

3-Kernel Design

Two kernels are used within the applications which are to calculate the first fitness values
and to train ANN.

Firstly, let’s see the kernel responsible in calculating first fitness values of chromosomes.
Before that, chromosome array is created and initialized by random numbers between the
values -10, 10. Chromosome array consist of chromosomes which are simply the weight
values for the ANN. This initialization is made at host side and same number generator is
used with serial implementation

After, memory allocation is made at GPU memory for training data, initial chromosome
weights and other necessary parameters. After allocation process, all the values in question
are copied to memory locations which are allocation by “CudaMalloc”.

__global__ void calculateFirstFitness(float *individuals, ANNDATA
*trainData,int trainDatasize,float* fitnessVector)

The first parameter is chromosome array for weight values, train data is the structure for
training data, and fitness vector is the array into which weigh values will be stored. This
kernel is called from the host as;

calculateFirstFitness<<<MAX_POPULATION / BLOCK_SIZE, BLOCK_SIZE>>>

For each chromosome, a thread is created so that, each thread picks a chromosome up to
calculate its fitness with following code snap.

int blockId       = blockIdx.x;
int threadId      = threadIdx.x;
int index = blockId * blockDim.x + threadId;
Chrosomome individual = &individuals[CONNECTION_NUM * index];

It iterates over the all training samples and calculates the error rate same as in serial
implementation. To feed forward the input data through ANN and to calculate the error rate,
device functions, which will be described in the next section, are used. After calculating the
error rate for each individual, fitness value (error rate) is written to related fitness array
index. So the aim is to decrease the fitness value (decreasing the error rate).

The most important part of the application is the kernel responsible in training of ANN. It is
the heart of the application where all the training process takes place.

The main steps are almost same with serial implementation but implemented in parallel
manner. At first, let`s see the signature of the kernel and how it is called from host side.

__global__ void evolvePopulation(float* individuals,ANNDATA
*training_data, int trainDataSize, Rand48 *random, float
*fitnessVector, Chrosomome res)

Random is the random number generator used for CUDA and will be described in detail. The
best chromosome is returned in “res” array.

As a design decision, training data is cached in the shared memory for fast access. Limited
shared memory storage is a restriction however; we assume that training data will not bigger
than 16KB. The advantage of Genetic algorithm based training is that, we don’t need large
size training data. So, there is no problem with that assumption to store training data in
shared memory. If we are using genetic algorithms for training purpose, we would better
take advantage of GA by using few training data sample.

extern     __shared__ ANNData data[];

//cache trainig data to shared memory
if(threadId == 0){
      for(int i = 0 ; i < trainDataSize ; i++)
            data[i] = training_data[i];

Data array was decelerated as __shared__ since it will be stored in shared memory; it is
extern because we don’t know the size at compile time. After this declaration, the 0th thread
in each block is responsible for copying training data from global memory to shared memory.
After copy operation, barrier synchronization is used since we want to be sure training data
is stored in shared memory before starting to training process.

At this point, I want to point out a performance related term called “bank conflict”. Shared
memory in GPU divided into sections called “Bank”. The aim is to access these banks
simultaneously by different threads. However, if more than one thread wants to access same
bank, CUDA serializes and allow them to access that memory location one after another. It
decreases the parallel performance of the program as different threads try to access same
bank resides in shared memory. As a result, they cannot access the bank in parallel so they
access it in serial manner. (NVIDIA cuda programming guide)
If you look at training data fetching code, it can be seen easily that 0th threads of all blocks
try to access same training data element at the same time. This creates a bank conflict
described above. Although there are some methods to solve bank conflict problems, it was
not handled in the scope of this project. Same bank conflict problem resides while
calculating the error rate which will be discussed later.

To store training data in the shared memory provides us a low latency access to data. It is
utmost important as while calculating the fitness value, thread has to pass over all the
training data one by one and all the threads have to do same operation. Assume that our
block size is 256 and we have 128 training samples. It means 256 * 128 = 32768 times thread
access global memory. Let us say each global memory access takes about 600 cycle, 32768 *
600 = 19.660.800 cycle in order to process all training samples for a block. However, in our
approach, 0th thread will access the global memory 128 times which results 128 * 600 =
76800 cycle then for every access to shared memory will take about 3-4 cycle for thread
which means 256 * 3 = 768 cycle in total. So totally, 76800 + 768 = 77568 cycle when we use
shared memory. If we compare these results, approximately, shared memory solution will
run 253X times faster than just global memory solution.

Fitness array is also cached in shared memory for fast access. Instead of all fitness arrays, just
related part of it stored in shared memory as shown below, (let us assume that block size is 5
for easy presentation)

Fitness Array
0   1     2   3    4    5    6     7   8    9    10 11 12 13 14 15 16 17 18 ..

Block 0                  Block 1                  Block 2

__shared__ float s_fitVec[BLOCK_SIZE];

s_fitVec[threadId] = fitnessVector[index];

Each thread in block is responsible in fetching the related fitness value and copying to the
shared memory. As you can see global fitness array is indexed with unique thread id
however, shared fitness vector is indexed with local thread id in block as we keep just the
related part of global fitness array.
After caching kernel related data to shared memory, we can start our evolution process.
(Mutation, crossover and selection)

First of all, three random numbers are generated to select different individuals in population
for mutation operation. These random numbers are used as indices to select individuals.
After selecting individuals, mutation function is called which will be explained in next section.
Mutate operator creates a mutated vector.

The next operation is crossover operation. The way It is done described in serial
implementation as well. However, instead of generating random numbers during crossover
operation, they are generated before and passed to crossover function as a parameter. After
this operation, we have an evolved individual. So, the next step is selection which means
checking whether it reduces fitness value or not. It is done with following code snap;

float result[OUTPUT_NUM];
float err = 0.0f;
for(int i = 0 ; i < trainDataSize ; i++){
      err += getANNerror(result,data[i].output);

if(err < s_fitVec[threadId]){
      s_fitVec[threadId] = err;
      for(int indx = 0 ; indx < CONNECTION_NUM ; indx++){
        individuals[index * CONNECTION_NUM + indx] = crossVec[indx];

If it is decided that new vector reduce the error rate, it is written to fitness vector stored in
shared memory and new individual is replaced with old one. Finally, each thread in block
copies its fitness value from shared memory to global memory. So, one cycle for evolution is

For CUDA side random number generation, a linear congruential generator has been used.
The source code was gathered from http://www.amolf.nl/~vanmeel/mdgpu/download.html. It is
created on host size, initialized by a seed than passed to kernel as a parameter. In kernel
code, before generating random number, the state is loaded first, then numbers are
generated then the state is stored again. The sample code in terms of code generation is
shown below,

4-Device Functions Design

Device functions called from kernel are showed below,

       __device__ void feedForward(Chrosomome chr, float* data,
        float *result)

    Described in previous section

       __device__ float getANNerror(float* actual, float

    Calculates the error rate between actual and desired values and returns the error value.

       __device__ void mutate(float *r1, float *r2, float
        *r3,float *mutVec)

    Creates a mutated vector according to,        =       +µ(         –    ) formula and
    returns mutated vector in mutVec pointer.

       __device__ void crossover(float *original, float *trial,
        float *randomNumbers, float *crossed)

    Makes crossover between original and trial vectors by using randomNumbers and
    returns the result vector in crossed pointer.

       __device__ void findMinFitness(float* fitnessVec, float

    Scans the fitness array and finds the minimum fitness value and returns it in result.

4. Results

In order to test our CUDA application, sinus function has been used as benchmark since it is a
common benchmark method for ANNs and easy to generate its training data. (Find sin
benchmark sources)

For testing purpose, 5 different training set have been created between 0 and 90 degree.
These training data have been sampled for each 0.25, 0.5, 1, 1.5 and 2.5 degrees.

To approximate sinus function, 3 layers ANN has been used, 1 input layer, 1 hidden layer and
1 output layer. Besides, hidden layer consists of 6 neurons.

Default mutation rate is 0.2 and default crossover rate is 0.6. Beside, the performance is
measured for different population sizes. While comparing performance between serial and
parallel implementation, same parameters have been used.
5. Conclusion

6. Future Work



GA    - Genetic Algorithms

DA – Differential Evolution

ANN    - Artificial Neural Network

GPU    - Graphic Processing Unit

CUDA - Compute Unified Device Architecture

CPU    - Central Processing Unit

SIMD - Single instruction Multiple Data

HLSL - High Level Shading Language
Cg   - C for Graphics

To top