Embed
Email

Dissertation

Document Sample

Shared by: linzhengnd
Categories
Tags
Stats
views:
4
posted:
12/18/2011
language:
pages:
52
School of Computing and Creative Technologies



M.Sc. Computer Games Technology









Artificial Neural Network Implementation

Using

Genetic Algorithms on GPU using CUDA







Emre Caglar

0804329







University of Abertay Dundee



December 2009

1. Introduction





1.1 Artificial Neural Networks and Genetic Algorithms on GPU



It is one of the well known facts that, in recent years, artificial intelligence for games has

been more and more important. Considering games that have been produced in recent

years, it is clear that those games look incredible in terms of reality. However, not only is

realistic graphics make the game perfect, but also game developers have to support the

games with so-called intelligence. So we can call a game “so realistic”, if we can succeed to

combine both realistic graphics and artificial intelligence at the same time. The racing game

Colin McRae's Rally 2 is one of the sample that ANN is used to controlled opponent cars in

game.`









Artificial neural networks are one of the most effective AI techniques which are capable of

classification, approximation, learning, decision making and controlling. As a controller and

decision maker, artificial neural networks are widely used in games and its success has been

proved. The biggest constraint with neural networks is the essentially of training process.

Training data is one of the most important factors in terms of weight optimisation process of

neural network. It is relatively expensive operation especially if the training data is large. The

most common technique used to train ANN is the “Back-propagation” algorithm. Basically, it

feeds the neural network with training data and calculates the output. Then, it compares the

output and desired output and calculates the error rate. Finally, it back propagates the error

rate starting from the output layer to input layer and adjust the weights according to error

rate. Despite the fact that Back-Propagation algorithm is the most widely used training

algorithm, it has several disadvantages such that, it can stuck in local minima points, the

transfer function is supposed to be differentiable and it is difficult to parallelize. Moreover, it

is also so time consuming to back propagate error rate especially if the network has lots of

hidden layers and nodes.



Alike artificial neural networks, genetic algorithms are also based on biologic structure of

human and genetic algorithms are also so powerful to solve complex problems in a short and

efficient way. Genetic algorithms have been used to optimise solution space. Doing this, it

uses same principles with biologic genetic process. Firstly, some possible solutions related to

problem are coded in Chromosomes. The individuals carrying these chromosomes compose

the population. Population size and chromosome lengths are predefined values specific to

problem. After initialising initial population, a couple of genetic operators are applied to

chromosomes such as mutation and crossover. The aim is to provide divergency for solution

space. Mutation is made simple changing some bits or values of the chromosomes randomly.

As a result of this process, a new individual is generated. Another operator in terms of

genetic programming is crossover. The most common technique for crossover is to select

two parents and by changing some parts of genes and creating a new individual as a result.

Although crossover and mutation operators are the most common operators used, there is

lots of different implementation techniques for those operators. After mutation and

crossover, we calculate the fitness values of the individuals. The aim is to select individuals

which adapt the environment best. The fitness function is so important for evolution process

since it decided which individuals (solutions) are the best to solve the problem in question.

There are several selection techniques as well. One of them is selection %50 of whole

population whose fitness values are closer to 1 if we consider 1 is the perfect solution.

Another selection technique is that, we choose two random individuals and we select the

one whose fitness value is bigger than other and we eliminate the weak one. This solution is

also provide a opportunity that some weak individuals can also transfer their genes to next

population. It provides us better divergence. However, the method used in this project is

none of them :)



As i mentioned earlier, genetic algorithms are perfect for optimising. So why not to use GAs

to optimise weight values in a neural network. Although back-propagation is the most used

techniques to train neural networks, genetic algorithms became more and more popular to

optimise weights in recent years and some successful results obtained. It has some

advantages as well. Firstly, we do not have to use a differentiable transfer function in ANN

(with back-propagation, while updating weights, we are using the differential of transfer

function).Moreover, GA base solution does not stuck in local minima since it tries to find the

optimum solution in a large solution space using hundreds of chromosome. Lastly, easy to

implement as parallel which we are interested in this project.



Although, even they produce efficient solutions when ANN and GAs are used together, it is

computationally expensive process to train network. Especially for large training data sets

and large populations, it is extremely time consuming operation to train network since each

individuals test the each training values in data set file and produce a error value for each of

training set and then average the error value. Suppose we have 10.000 training data and for

each of them, training process will load the inputs to NN, will get the output value of NN and

will compare it with desired output. Then a fitness value will be calculated depending on

error value. If we have a population with 1000 individual, you can estimate how expensive

the training process is. That is why we have to parallelize it to get efficient results in an

acceptable time interval.

GPUs are very powerful parallel processors nowadays and they offer a huge computation

power and bandwidth. Their architecture is quite different compared to CPU as CPUs are

specialised to execute code in a serial manner, so the architecture is appropriate data

caching and flow control. For that reason, they have just a couple processors unlike GPU.

Unlike CPUs, GPUs are designed as SIMD processors in order to process a huge amount of

data simultaneously. GPU units have started to be used for general purpose computing since

a couple of years. However, programmers had lots of difficulties while implementing code

for GPU as they had to know very strong computer graphics background besides some low

level shader languages such as HLSL or Cg. Moreover, GPUs are not suitable to solve every

kind of problem so; programmer had to port their problem to GPU considering its

architecture and constraints. Beside all these problems, the biggest problem of GPU

programming was the lack of accessing the memory (read/write) randomly. It was possible

to read memory locations; however, it was not possible to write data to memory locations

belongs to graphic cards. So, programmer needed to make some trick in order to save their

data like using textures as memory locations for their data.



All these were problem until the time NVIDIA announced CUDA technology. CUDA abstracts

all the underlying problems defined previously, and offers a general computing framework

for GPUs. Thanks to CUDA, programmers do not have to know shader languages and strong

computer graphics background. Besides, programmers don’t need to know 3D graphics APIs

as well. Moreover, programmers have access to memory for both reading and writing. But

the best thing about CUDA is that programmer can write programs that will run on GPU with

a language look alike C. With CUDA, thousands of threads can run simultaneously. Threads

are grouped in blocks and blocks compose the grid. In order to make mapping algorithms to

GPU easier, threads in blocks can be organised as 1D, 2D or 3D. Like blocks, grid can be

organised 1D or 2D for easy access. These are all specific parameters for specific problems.

Programmer should choose the most efficient parameters for the problem.



1.1 Dissertation Structure









xxxxxxxxxxxxxxxx









2. Literature Review



2.1 Artificial Neural Networks



Artificial Neural Networks (ANN) have their origin in the attempt to simulate by

mathematical means an idealised form of the elementary processing units in the brain and of

their interconnections, signal processing, and self-organisation capabilities. (Soft Computing,

A Tettamanzi, M. Tomassini, p:51)



Artificial Neural Network (ANN) is simple a pattern recognition approach. It is a kind of

model of biological neural network in our brain and tries to simulate processing and learning

approach of brain. So basically, an ANN consists of neurons and each neuron connected

other neuron groups. The connection between those neurons is extremely important

because the stronger the connection is, the more active neuron is to learn and process

ability. A simple brain neuron structure is shown in Figure 1 below.

So how can we model a biological neuron in order to create a artificial network on

computer? We can represent a neural network as groups of artificial neurons grouped in

layers. First layer of ANN represent the input layer and last layer represent the output layer.

Although there are several different kinds of neural network topologies such as recurrent

networks and self-organizing networks, only feed-forward neural networks will be discussed

in this section as it is used in this project.



Between input and output layer, there can be exist several number of hidden layers. Each

neuron is connected to all the neurons in next layer. Beside these, every neuron except input

layer has an additional input called bias. We will focus on the multilayer, feed-forward neural

network for the implementation. This type of network is quite versatile and is capable of

handling a wide variety of problems. So a sample ANN structure is shown below,

Level 1 Level 2 Level 3





B



0





Input 0 0 B



1 B



0

Output 0 1 B

Input 1

2

B



1

Output

1 2 B

Input 2



3 B

Output

2 2



Input 3 B

3

4



B



5









We inject our input to the ANN by using input layer. (Layer 1 in above figure) Choosing

appropriate inputs for ANN is extremely important because a large scale of input data may

cause unwanted results or so narrow input set might not be represent correct data set in

terms of our problem. Input data format is also important. We cannot use “char” type or

“Boolean” type as an input type since ANN can process only real numbers. So somehow we

have to represent our pattern as a number sequence related to our problem. Moreover, it

has lots of advantages if we use normalized input values for ANN like between 0.0-1.0 or 1.0-

100.0.



Another question is the inner structure and how ANN works.

Input 0 bias

W

0

Bias weight



Input 1 W

1 ∑

Activation

δ



Input 2

W

2

Transfer function activation function

Input 3

W

3









Suppose that we have picked a random neuron from the network and there are some

input values from other neurons and weights between the current neuron and other

neurons. In addition to inputs, each neuron has a bias input and associated weight. What

actually bias value does is that it shifts the sum of the weighted input along horizontal x axis.

It provides us a kind of flexibility that we can change the threshold using bias value. Most

probably bias value is chosen 1 or -1. So the total input to a neuron is,









N

Neuron x = ∑ inputi x wi + (bias x bias weight)

i=0





Then we use the output of this formula to calculate the out of the neuron by using a

activation function. An activation function simply gets the net output of the neuron,

operates on it and calculates the output of the neuron. The most commonly used activation

function is sigmoid function or logistic function. We will choose sigmoid function for the rest

of the examples. The formula of the sigmoid function is, (AI for game developers)





F(x) = 1 / (1 + e-x)

As you can see the output of the sigmoid function is always between the interval of 0.0

and 1.0 actually, the results never reach 0 and 1 values. At this point we have to define a

threshold value that decides to activate or deactivate the neuron. Suppose we have chosen

“0.5” as our threshold value. Then when sigmoid function produces a value bigger than 0.5

activates the neuron, otherwise, the neuron is deactivated. Moreover, you can say that if

sigmoid function produces 0.9 we activate the neuron and if it produces 0.1 we deactivate

the neuron. What I mean is that, we can change the active-deactivate behaviour by adjusting

the threshold value.



Actually, in order to get meaningful result from our ANN we have to train it. It can be

said that it is also the most difficult part of implementing a neural network. There are also

different kinds of training algorithms for neural networks. They can be categorized under the

name of supervised and unsupervised training. However, the most commonly used training

algorithm is back-propagation algorithm so I would like to explain the method briefly.



Training process of an ANN actually is an optimization process. To start training process,

we set random weights to neurons. After this initialization, we apply our training input to the

network and calculate the output generated by neuron. The next step is comparing the

generated output and the real output we want to obtain. The difference between those two

values provides us to calculate an error value. The most commonly used error evolution

formula is mean-square error. We can calculate that value with below formula, (AI for game

developers)





ε = ∑ (nc - nd)2 / m



While nc is the calculated value and nd is the desired value and m is the number of output

neurons for each epoch.

According to error value, we adjust the neuron weights to be able to obtain wanted

output. This adjusting process can depend on different techniques or formulas. We will use

the below formula to adjust weights





∆w = ρ δi ni



Ρ is the learning rate, δi is the error value we obtained and ni is the value of neuron being

considered. Then we can change the weight value simply adding this value to the old weight

value.



Although, it is a well know method and widely used algorithm by several applications, Bask-

propagation algorithm has a couple of drawbacks such as local and global minima, learning

topologies and generalization and over fitting. (Soft Computing, Integrating Evolutionary,

Neural, and Fuzzy systems, Chapter:2 p: 63,64,65)



 Local & Global Minima



Because of back-propagation is a gradient-descent search algorithm, it finds the closest local

minimum with respect to the starting point of the search. (. (Soft Computing, Integrating

Evolutionary, Neural, and Fuzzy systems, Chapter:2 p:,64) So, considering search space is

multi-dimensional and has lots of local minimum point, most probably back-propagation

algorithm will stuck in a local minimum point as shown below.









Although there are some techniques to come over this problem like “adding momentum”,

(AI for game developers, chapter 14 p: 288) there is not any perfect solution for this problem.



 Learning Topologies



Back propagation or similar algorithms adjust weights for fixed feed forward networks.

However, it is clear that the interconnection of the units and their number plays an

important role and lack of knowledge in determining the appropriate topology for a

problem, including the number of layers; the number of neurons per layer often results in

slow learning speed and poor performance. (Soft Computing, Integrating Evolutionary,

Neural, and Fuzzy systems, Chapter:2 p:,64-65) )



 Generalization and over fitting



If our network adapts to training data so well, the learning error rate will be too small,

however, the test error rate will be high as it won’t be able to approximate the given

function to produce desired output. In other words, network starts to learn input and

outputs instead of approximate them. (Soft Computing, Integrating Evolutionary, Neural, and

Fuzzy systems, Chapter:2 p:,65-66) )









Due to the problems discussed above, evolutionary algorithms are started to be used in

order to come over these problems and successful results have been obtained. Genetic

algorithms and genetic algorithms based network training & weights optimization will be

discussed in following chapter.



So how good are multi-layered feed-forward networks? In order to answer this question, we

have to figure out to which parameters it depends.



 The learning algorithm and the number of iterations

 The number of learning samples in order to decide how good the training data represent the

actual function

 The number of hidden units



The number of learning sample is one of the important parameter to measure how good the

network is. (Instruction to neural networks, Ben Krose, Patrick van der smagt, November

1996) Krose and Smagt use a sample function to testing the neural network. Y= f(x).They use

two different number of learning sample to compare result. First, they use 4 learning sample

and then they use 20 learning sample. With 4 learning sample, the error Elearning is small,

however Etest is large. Since the learning samples are not enough to approximate the

function to the desired output value. With 20 learning sample, the test error rate decreases

while leaning error increases. The graphics shown below show this two different cases and

how good the neural network approximates the function.









The dashed line gives the desired function, the learning samples are depicted as circles and

the approximation by the network is shown by the drawn line. A) 4 learning samples B) 20

learning samples.



For another test, the same function is used but different hidden unit numbers have been

used. Firstly, network has been tested with 5 hidden units then tested with 20 hidden units.



With 20 hidden units, the effect called overtraining can be seen easily from the results. The

network fits exactly with the learning samples, but because of the large number of hidden

units, the function which is actually represented by the network is far wilder than the

original one.









The dashed line gives the desired function, the circles donate the learning samples and the

drawn line gives the approximation by the function.12 learning sample is used. A) 5 hidden

units B) 20 hidden units

“The example shows that a large number of hidden units leads to small error on the training

set but not necessarily leads to a small error on the test set. Adding hidden units will always

lead to a reduction of the learning error rate. However, adding hidden units will first lead to

a reduction of the test error rate but then lead to an increase of test error rate as shown

below.” (Instruction to neural networks, Ben Krose, Patrick van der smagt, november 1996)









2.2 Genetic Algorithms



Genetic algorithms can be used for any problem where an optimisation process takes place.

Possible solutions are coded as individuals and they evolve over time by adapting themselves

to environment and the ones which do not fit are eliminated.



An evolutionary algorithm maintains a population of candidate solutions for the problem at

hand, and makes it evolve by iteratively applying (usually quite small) set of stochastic

operators, known as mutation, recombination (crossover) and selection. (Soft Computing,

Integrating Evolutionary, Neural, and Fuzzy systems, Chapter:1 p:,2)



A sample evolution process is shown below,

First Generation







Rank Fitness







Selection









Evolution



Taken from AI for game developers

Representation



Since I have intended to use genetic algorithms to optimize just weights in a neural network,

I will not mention about how to encode the ANN architecture and topology into a

chromosome. So, I will discuss about encoding strategies of weights briefly.



For weight based representation, the first basic decision should be made about which

method will be used between bit strings representation and real value representation.



One of the approaches for representing the connection weights is using bit strings. In this

approach each weight value would be represented as a binary string and then decoded into

real values between, say, -10, 10 or -1, 1. This type of representation has some drawbacks.

When the number of connections increase, the length of the binary strings increase to a size

which causes to slower evolution process.

(Evolutionary_algorithms_for_neural_network_design_and_training.pdf, Soft Computing,

Integrating Evolutionary, Neural, and Fuzzy systems, Chapter: 4 section: 4-2 p: 125)





4 ...

I0 5



I1 -2

6 ...









0 0 0 0 0 1 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 1 0 1 1 0 0 0 0 0 1 0





|---------------4--------------|--------------6------------------|-----------------5-----------------|-------------- -2

-------------|





In this sample, each weight value is represented with 8 bit and first bit is used as a sign bit.

So with 7 bit, we can encode the weights between -127, 127. It is suggested in (Combining

Genetic Algorithms and Neural Networks the Encoding Problem) that, one extra bit can be used to

encode if the connection is exists at all or not.



Another approach to encode weights is to use real value representation. Instead of bit

strings, weights are coded directly as real values. Most of the current genetic algorithm

weight optimization methods use real value representation.

(Evolutionary_algorithms_for_neural_network_design_and_training.pdf)



For same network illustrated above, the real value representation is shown below;

4.2 ...

I0 5.4



I1 -2.0

6.1 ...









4.2 6.1 5.4 -2.0





Real values can be implemented as integers or floats and they are usually kept in a certain

interval. When initialising the first population and assigning random values to chromosomes,

usually a pre-defined interval is chosen and numbers are generated randomly in that

interval. Usually, weight values are assigned between the interval of -10, 10 or -20, 20.

(Neural network weight training by mutation, USING A GENETIC ALGORITHM FOR

OPTIMIZATION OF SYNAPTIC WEIGHTS OF NEURAL NETWORKS)



The other important answer that is supposed to be answered is in which order weights are

supposed to code into the chromosome. If you consider crossover operator (which will be

discussed in following sections in detail), it is more likely that the genes far from each other

are disrupted compared to close ones. See the following figure,









Gene1 Gene2 Gene3 Gene4 Gene5 Gene6 Gene7 Gene8 Gene9 Gene10





Cut point





As you can see it is more likely that gene2 and gene9 will be disrupted compared to gene2

and gene3. Because of this feature, it is recommended that to organize genes tightly which

are functionally close. In other words, it is recommended to organize all the incoming and

outgoing weights of a hidden node next to each other.

(Evolutionary_algorithms_for_neural_network_design_and_training.pdf)



Selection of the encoding model strongly depends on the problem we are trying to optimize.

However, we should keep in mind that both binary strings and real value encoding have

some drawbacks. One of the drawback in terms of bit strings is that if the network structure

gets more complex, then chromosome lengths will be too much which result the slow

evolution process as mentioned above. However, the drawbacks of real value code are

utmost important.



One of the problems with real value implementation is that mutation is the only way to

change weight values. What I mean is that, changing the values between two chromosomes

only results another weight combination of parent chromosomes. However, considering bit

string representation, changing bits between chromosomes results a different weights

combination. So, this is one of the weaknesses of real time implementation. (Training of

Neural Networks by means of Genetic Algorithms Working on very long Chromosomes)







//CAN BE EXPRESSED THE REDUNCANSIES in this section as well



Mutation



Mutation operator simply changes the value of some genes in a chromosome randomly. For

bit string encoded chromosomes, mutation accurse by flipping some random bits. Mutation

rate in evolution process decides the probability of mutation in a population. Suppose if the

mutation rate is 2%, it means two individuals in a population with size 100 will be mutated.

See the diagram below to see the effect of mutation for a bit string represented

chromosome.



0 0 1 0 1









0 1 1 0 0





Chromosome before mutation: 00101

Chromosome after mutation: 01100



It is bit different when real value representation is used. A mutation operator adds or

subtracts an integer and shifts the weight value in a small amount. The value that will be

added or subtracted can be chosen in an interval (1...α) where α is the amount of mutation.



Crossover



Crossover operator creates a new individual by selecting genes from parents randomly.

There are different kind of cross over implementation but all of them rely same principles.

First check how crossover operator process on bit string based chromosomes.









Parent 1 Parent2



0 0 1 1 0 1 1 0 0 0









0 0 1 0 0

Above figure shows one-point cut crossover. There exists two or three points cross over

operators as well. By cross over operator, parents change their genetic materials and they

create a new individual as a result.



Fitness Function



Selecting fitness function is one of the most important decision on Genetic Algorithm since it

decide how well the solution for the specified problem. Besides, it id decided whether a

chromosome will transfer its genetic material to next population or not depending on the

chromosome`s fitness value.



Selection



Briefly, selection means the fittest individual survives. Selection is made based on fitness

values of the chromosomes. The chromosome that has the bigger fitness value compared to

others is more likely that it will survive in next generation as well. A couple of different

selection methods exist such as tournament-based selection or roulette-wheel selection.



Tournament based selection simply selects two random individuals from population and it

saves the one with bigger fitness value and discards other one.



Roulette-wheel selection xxx



If we sum up all the steps, we can summarise the process with fallowing flowchart

(http://files.myopera.com/blu3c4t/blog/a_basic_genetic_algorithm.jpg)

So, lastly, what are the termination criteria while evolving the population? Both best fitness

value depending on the error rate and generations number is important when termination

criteria is concerned. Sometime reaching a certain error value cab terminates the progress or

reaching the maximum generation number as well. Moreover, sometimes both of them can

be used. It is totally up to problem domain.



Evolutionary Algorithms on ANN Training



Evolutionary algorithms have being used for years and successful results have being

obtained. (xxx)



The first step is to create initial population. Each chromosome composes of weight as much

as connection number in ANN. Each chromosome represents a possible solution set of

weights for ANN.



When training the neural network with the help of genetic algorithms, fitness values are

calculated according to error rate of ANN for a specific chromosome. The pseudo code has

shown below;



for each chromosome in Population

do

Load weights to ANN

For each data set in training samples

do

Load inputs to ANN

Feed forward the input

Calculate output

Compare actual and desired output

Cumulate the error value

end



end







After processing all the learning data samples, an error value is obtained for the

chromosome. We can use this error value as a fitness value of the chromosome. So, the

smaller fitness value is, the more successful the chromosome is to solve that problem



2.3 Differential Evolution Method



Differential evolution is another kind of evolutionary programming method with some

differences in mutation and crossover operator.



In differential evolution method, new individuals are generated by the combination of other

chromosomes from the population. As mentioned in previous sections, mutation operator is

applied to chromosomes simply changing some random bits or adding/subtracting some

values. However, in differential evolution, two chromosome vectors are selected randomly;

their difference is calculated, after multiplying the difference with a mutation constant, we

add it to another randomly selected chromosome.



Let`s say w represent the chromosomes, g for generation and k for indexing the individual in

population. So, let`s say mutant vector can be generated according to following rules;

= +µ( – )



= +µ( – )



= +µ( – )+µ( – )



Where is the best individual of the gth generation and r1, r2 and r3 is mutually

different and different from the running index i. As you can see, differential evolution

method is in a form of combining other individuals and generating a trial chromosome as a

result. µ > 0 is a real parameter called mutation constant, which controls the amplification of

the difference between two individuals.



After generating mutated individual, crossover (or recombination) operator is applied to the

mutated individual. To decide which gene will be transferred to new individual, a random

number is generated between [0, 1]. If random number is less than a crossover rate, the

gene transferred from trial mutant vector, otherwise from target vector. This processes is

illustrated below,



Population size = 10

Suppose for chromosome 2,



Ch1 Ch2 Ch3 Ch4 Ch5 Ch6 Ch7 Ch8 Ch9 Ch10









+µ( - )









Trial vector



Random

[0, 1]









Result vector







Figure DE evolution



After generating result vector, select it if its fitness value is less then original one. Otherwise,

keep the original one.

2.4 CUDA



INTRODUCTION



CUDA (Compute Unified Device Architecture) is a both hardware and software model which

allows us to run parallel implemented code on GPU without the need of mapping to any

graphics API. CUDA scales the programs to 100`s of cores and 1000`s of threads running

simultaneously. CUDA allows programmer to focus on parallel algorithm instead of the

mechanism of a parallel programming language or graphics APIs.



The program part which runs on GPU simultaneously is called “Kernel” in CUDA literature.

One kernel is executed on GPU at a time and many threads (1000s) execute each kernel.

There are important differences between CPU threads and GPU threads.



 CUDA threads are extremely lightweight which means very little creation overhead

and instant switching.

 CUDA uses 1000s of threads to run efficiently. However, CPU can use only a few







In CUDA literature, CPU is called as “Host” and GPU is called as “Device”.



THREAD HIERARCHY



CUDA also provides a thread hierarchy in order to make kernel mapping to GPU easier.

Thread hierarchy is also defines which thread can reach which memory locations and also it

is used for thread synchronization.



Each thread runs the same kernel issued by host and has its own local memory and registers.









Local Memory

Registers









Threads are batched in thread blocks. Each block can consist of up to 512 threads and

threads in the block can be organized as 1D, 2D or 3D. Each thread inside the same thread

block has access to shared memory and can cooperate via shared memory.







Shared Memory



Thread Block

Threads in each block can be synchronized however; threads in different blocks can be

synchronized. Moreover, threads in same block can cooperate via shared memory.



Each thread has given a unique thread id when the kernel is lunched. This thread id variable

is a built-in variable and can be assessed by x, y and z components as threads can be

organized as 1D, 2D or 3D.







Thread Block









(0, 0) (1, 0) (2, 0) (3, 0)









(0, 1) (1, 1) (2, 1) (3, 1)







(0, 2) (1, 2) (2, 2) (3, 2)









(0, 3) (1, 3) (2, 3) (3, 3)





As you can see, threads in a 2D block is identified with x and y components and can be

accessed via built-in variable threadIdx



Thread blocks are required to execute independently since they can be executed in any

order, in parallel or in serial. This independence requirement allows thread blocks to be

scheduled in any order across any number of cores.



Thread blocks compose “grid” and grids can be composed of 1D or 2D thread blocks as

shown below;

Shared Memory Shared Memory Shared Memory Shared Memory Shared Memory









Each block is given a unique block id and can be accessed via blockIdx built-in variable. A grid

can consist of 232 blocks resulting in a total of 241 threads.



Block size also can be reached via blockDim variable. It is a 3D variable and to get the size of

each dimension, we can use x, y or z component of the variable.



There is no way to synchronise thread blocks. Only threads in same block can be synchronized. And

different blocks cannot use each other`s shared memory. Each thread has an access to global memory

at grid level.









Global Memory





Kernel1 >









Kernel 2>



Software model of CUDA highly coupled with hardware model. So, in order to understand

software model like thread hierarchy or memory hierarchy, it is better to have a look to

hardware model of CUDA.



ARTHITECTURE AND MEMORY HIERARCHY



Let us see the hardware architecture of NVIDIA G80 series GPU which is capable of running

CUDA as well.



Multi

Processor

TP TP







TP TP



TP TP



TP TP





Shared

Memory

TP = Thread Processor



As you can see G80 series GPUs has 16 Multi-processors which means 16x8 = 128 thread

processor. Each multiprocessor has 8 thread processor and each thread processor in same

multi=processor can access share memory of the multi-processor.



A grid is executed on the device by scheduling thread blocks onto the multiprocessors.

Thereby, each block is mapped to one multiprocessor. Multiple thread blocks can be mapped

onto the same multiprocessor and are executed concurrently. If multiple blocks are mapped

onto a multiprocessor, its recourses, such as register and shared memory, are split among

the mapped thread blocks. This limits the amount of thread blocks that can be mapped onto

the same multiprocessor. It is called occupancy the ratio of active threads on a

multiprocessor to the maximum number of threads supported by a multiprocessor.

(Cuda_data_structure_nice.pdf20/78)



CUDA devices use several memory spaces which has its own characteristics that reflect their

usage in CUDA applications. These memory spaces include global, constant, shared, texture,

local and register as shown below,









HOST DEVICE

DRAM GPU

CPU Multi Processor

Local

Memory

Registers

Shared Memory

Global

Memory



DRAM Chipset Texture Multi Processor

Multi Processor

Constant









As you can see, host can read and write to Global memory, texture memory and constant

memory, however, constant memory and texture memory are read only memory spaces for

GPU. In the scope of this project, only global memory is used to transfer data to GPU.



As you can see, global, texture and local memory spaces reside in device e DRAM location,

they have the greatest access latency (although texture is cached), followed by constant

memory, registers and shared memory.



The various principles traits of the memory types are shown below,

Memory Location Cached Access Scope Lifetime

on/off chip

Register On n/a R/W 1 thread Thread

Local Off No R/W 1 thread Thread

Shared On n/a R/W All threads in block Block

Global Off No R/W All threads + host Host

allocation

Constant Off Yes R All threads + host Host

allocation

Texture Off Yes R All threads + host Host

allocation





SOFTWARE MODEL



CUDA programs can be written either using C for CUDA or CUDA Driver API. C for CUDA is

selected in this project since it has C style syntax and easier to learn.



C for CUDA extends C and allows programmer to define functions, called “kernels” , and

executed N times in parallel by N different CUDA threads unlike traditional C functions.



A kernel is defined with a __global__ declaration specifier which means it will be called by

host. Kernel calls from host code is done by kernel name and parameters followed by >>

syntax to provide some attributes such as block size, grid size, shared memory size per block

etc... Beside kernel functions have to be “void” return type. A sample declaration and host

code showed below;







void __global__ vectorAdd ( float* vec1, float* vec2, float* result

){



//add operation

}



int main (){

//

vectorAdd>>(A, B, C);



}







It invokes a kernel call with just one block in grid and N thread in block.



As I mentioned in previous sections, each thread is given a unique thread id which can be

accessed via built-in threadIdx variable in device code. threadIdx is a dim3 variable type

which is a built-in type in CUDA and has x, y and z components. Just for illustration, we can

improve the code show above with threadIdx variable,

void __global__ vectorAdd ( float* vec1, float* vec2, float* result

){

//add operation

int thread_id = threadIdx.x;

result[thread_id] = vec1[threadId] + vec2[threadId];

}



int main (){

//

vectorAdd>>(A, B, C);



}









This kernel will be executed by several threads with different thread Ids. So that, each thread

will use thread id as an index and will access the variable using that index in arrays and will

perform the adding operation. See the diagram below in order to see how this adding

operation is made in parallel.







0 1 2 3 0 1 2 3 0 1 2 3









Thread id = 0 write



[Add]









Most of the kernel codes use the unique thread Ids in order to access data in parallel. So it is

important to identify each thread correctly. To do this, we use built-in variables like

blockDim, blockIdx and threadIdx. Device memory has a serial and one dimensional memory

layout physically like other memory spaces like main memory in host. So that, 2D and 3D

thread ids are all virtually and just make the mapping easier for programmer. That is why,

thread ids are supposed to converted to unique thread ids in grid. The most common pattern

to find the unique thread id is blockIdx.x * blockDim.x + threadIdx.x

Grid

0 1 2









0 1 2 3 4 0 1 2 3 4 0 1 2 3 4









0 1 2 3 4 5 6 7 8 9 10 11 12 13 14







blockIdx.x * blockDim.x + threadIdx.x blockDim.x = 5









Threads within the same block can cooperate among themselves by sharing data through

shared memory and synchronizing their execution to coordinate memory access.

Synchronizing among threads in same block is done via __syncthreads () function provided

by CUDA. This functions acts as a barrier at which all threads in the block must wait before

any is allowed to proceed.



Previously mentioned, host and device have separate memory spaces and data is transferred

from host to device via PCI bus. Host manages device memory, so CUDA provides some c like

functions to allocate, free and copy memory locations. The important point is that

programmer should keep in mind that pointers to memory locations are just numbers. That

is why; we cannot tell from the pointer value whether it shows a memory location on CPU or

GPU. So, dereferencing device pointer on host most probably crashes the program and same

for vice versa.



Functions provided by CUDA to manage device memory is easy to use as they look like C

memory operation functions. Mainly, there are three kinds of important memory managing

functions which are frequently used while developing a CUDA application. These are,







cudaMalloc(void** pointer, size_t nBytes);



cudaFree(void* pointer);



cudaMemcpy(void* dest, void* src, size_t nBytes, enum

cudaMemcpyKind direction);

enum cudaMemcpyKind can be any of the followings,



 cudaMemcpyHostToDevice

 cudaMemcpyDeviceYoHost

 cudaMemcpyDeviceToDevice







As you can see, CUDA interface provides memory copy operations between host and device,

device to host and device to device. The point which should be kept in mind when memory

operation functions are being used is that, programmer should be careful about which

pointer is a device pointer and which one is a host pointer. In order to make the

programming easier, there can be used a specific notation when declaring a variable such

that “h_” prefix for host side variables and “d_” prefix for device side variables. Throughout

the thesis, this notation will be used to manage pointer easier and prevent the memory

crashes.







// Device code

__global__ void VecAdd(float* A, float* B, float* C)

{

int i = blockDim.x * blockIdx.x + threadIdx.x;

if (i >>(d_A, d_B, d_C);



// Copy result from device memory to host memory

// h_C contains the result in host memory

cudaMemcpy(h_C, d_C, size, cudaMemcpyDeviceToHost);

// Free device memory

cudaFree(d_A);

cudaFree(d_B);

cudaFree(d_C);

}







Above code demonstrates the basics of memory operations such as allocate, copy and free.

First of all, memory is allocated on host side by “malloc” functions then device memory is

allocated by “cudaMalloc” CUDA functions and after, vectors are copied form host to device.

Finally kernel is lunched and pointers are freed. These steps are the very basic and common

patterns for memory operations while writing a CUDA application.



Until now, we have talked about allocating variables and one dimensional array. What

happens if we need 2D or 3D arrays on device? In this case, CUDA recommends to the

developers to use cudaMallocPitch () and cudaMalloc3D (). These functions are

recommended for allocation of 2D or 3D arrays as it makes sure that the allocation is

appropriately padded to meet the alignment requirements. The returned pitch must be used

to access array elements. The following code sample allocates a width x height 2D array of

floating-point values and shows how to loop over the array elements in device code:







// Host code

float* devPtr;

int pitch;

cudaMallocPitch((void**)&devPtr, &pitch, width * sizeof(float),

height);

myKernel>>(devPtr, pitch);



// Device code

__global__ void myKernel(float* devPtr, int pitch)

{

for (int r = 0; r >>



 grid : grid dimension up to 2D (dim3 type)

 block: block dimension up to 3D (dim3 type)

 stream: stream ID (optional)

 shared_memory: number of bytes per block (optional)





Until now, we have used some defined built-in variables in CUDA. These built-in variables can

be accesses in __global__ and __device__ functions. These variables are listed below,



 dim3 gridDim;

o Dimensions of the grid in blocks (at most 2D)

 dim3 blockDim;

o Dimensions of the block in threads

 dim3 blockIdx;

o Block index within the grid

 dim3 threadIdx;

o Thread index within the block



Dim3 is a vector type extending int and can be accessed using x,y and z components.



There are also a couple of built-on variable qualifiers defined by CUDA framework which are;



 __device__: If a variable is identified with this qualifier, it means the variable is

stored in global memory which is large and has high latency and no cache. These

kinds of variables are allocated via cudaMalloc described above and accessible by all

threads.

 __shared__: These kinds of variables are stored in on-chip shared memory which has

very low latency. They are specified by execution configuration or at compile time

and accessible by all threads in same block.



Unqualified variables like scalars and built-in vector types are stored in registers and if they

do not fit in registers they are spills to local memory.

xxx







3. Methodology



Mainly, I have implemented the project in two parts. One of them is the serial

implementation of neural network trained by genetic algorithms on CPU and the other one is

on GPU using CUDA.



IMPLEMENTING NEURAL NETWORK ON CPU



The first design decision while implementing CPU code was to implement it as low level as I

can do. Because, there is no need to implement hundreds of classes and to encapsulate

everything if it is possible to solve the problem easier and with less coding. Another thing is

that, it is easier to port low level code to GPU then object oriented design and coding.



Another reason for doing both serial and parallel implementation is that, before

implementing parallel, to understand the basics of neural networks and genetic algorithms.

Another thing is that, in order to measure performance between parallel and serial

implementation and to show how fast the parallel implementation is. As I propose that,

parallel implementation on CUDA is much faster than serial implementation which means,

most of the AI workload on CPU can be shifted to GPU and AI in games can be run faster on

GPU allowing CPU to run other stuff.



First of all, ANN composes of nodes and its connections. It is clear that, nodes can be

represented as a 2D array and can be identified by its layer number and node number in that

layer. Similarly, connections can be represented as a 3D array and a connection can be

identified as the connection between the given neuron in given layer and the given neuron in

previous layer.



These arrays could be defined static arrays with a size. However, it would waste memory

spaces as all the layers have not the same number of neuron and moreover, layer numbers

and neuron numbers should be able to given from an external source so that those numbers

are not static. Considering all of these, the best solution is to implement nodes and

connections as pointers. That is why “**” (pointer-to-pointer) and “***” pointer-to-pointer-

pointer are used to define neural network. You can see the diagrams defined below to see

how neural network is represented in memory using pointers.



This diagram shows how a 3 layer (3-4-2) neural network can be implemented by pointers.

Neurons are simple 2D dynamic allocated arrays in memory as we can identify each neuron

using two indices. Layer number and neuron number in that layer.

*

**

X[0][0] = neuron a

x X[1][1] = neuron b



Pointer-to-pointer





Pointer array

(Indicates layers)

Double arrays

a

(Indicates neurons)



b









Weights are represented as 3D arrays as we can identify weights using three parameters.

Layer number, neuron in that layer and neuron in previous layer.





*** Pointer-to-pointer array ** (per layer)

NULL





Pointer array * per neuron

(for each pointer, allocate

space as much as the neuron

numbers in previous layer)





We can index a as

a [2][0][2] means

weights between

the first neuron in

layer 3(output layer)

and neuron 3 in

previous layer (layer

2)



So why needed lots of classes like Neuron class, layer class, neural network class? Neurons

are just 2D arrays and weights are just 3D arrays.



Dynamic memory allocations were done according the fallowing code snap.



//allocate memory for weights

double *** m_neuronWeights = new double **[m_numberOfLayers];

for(int i = 1 ; i > =

DatasetReader::getInstance()->ParseFile(“xx.dat”,

INPUT_NUMBER,

OUTPUT_NUMBER,

TYPE::DOUBLE);







With get instance, we obtain a data set reader reference, and then we parse the file. Parse

file function is a generic function and can be used with different types. Currently, double and

int data types are supported as float can be handled as double. So, from function call, we can

understand that we will read a file containing double values. First attribute of the function

shows which file we will read, second number is how many of them input values, third one

shows how many of them output values, and finally we use an enum (TYPE::DOUBLE or

TYPE::INTEGER) value to indicate to which type we will cast the read string from file as we

can understand the type from template parameter. That is why we have to indicate what

type we are reading second time.



Return type of the function is vector> DataPair is a structure

defined in DatasetReader header file which is composed of one input vector and one output

vector.



GENETIC TRAINING BASED ON DIFFERENTIAL EVOLUTION



The most important part of ANN implementation is the training process. For training, back-

propagation algorithm is used widely, however, due to the drawbacks that I have mentioned

in Literature review section; I have decided to use genetic algorithms to optimize ANN

weights. GAs is also more appropriate to implement in parallel fashion.

Firstly, I have created the structure of chromosomes. A chromosome is simply a double array

which is composed of neural network weights. For the representation of chromosome

composed of weights, please see the figure below,







4.2 2.3

I0 5.4



I1 -2.0

6.1 1.1









double* 4.2 6.1 5.4 -2.0 2.3 1.1









The first step is creating an initial population with random values. Maximum population

number that will be created defined in Constants header file mentioned previously. It is

suggested that, the population size should be 2X getMinFitness () > DESIRED_ERROR_RATE && g-

>mGenerationNumber cycle ();

}







As you can see desired error rate and max generation number values are constants and they

are defined in constants header file. Default values are 0.005 and 2000 for error rate and

max population number respectively. You can change the parameters in order to get

different results.



Cycle method is the main function where the evolution process is takes place. This methods

executes mutation, cross over and selection operators for each individuals. If new trial vector

decreases the error rate, it is accepted new individual, otherwise it is discarded. Please refer

to Figure DE Evolution.



All the functionality related to genetic training is encapsulated within “GeneticTraining”

class. This class also includes an ANN instance for training purpose. So all the parameters

related to genetic process and ANN should be passed to constructor. The constructor of

GeneticTraining class is shown below,



GeneticTraining( int numOfLayers ,

int* neuronsNum ,

int MaxPopulation,

float mutRate ,

float crossRate );







numOfLayers : Total number of layers of ANN

neuronsNum: Indicates the neuron numbers in each layer

MaxPopulation: Indicates max number of population

mutRate:mutation rate that is used during mutation process

crossRate:crossover rate that is used during crossover process





For all these parameters, their default values are defined in constants header file. If any of

them is not defined in constructor, it means their default values will be used.



Now we can describe the steps used during genetic algorithms based training in detail. First

step is initializing the population and calculating the fitness values of the initial population.

Calculation fitness function is one of the most expensive operations during training process

as it calculates every individual`s fitness functions and for each of them it process ANN for

each entry in training dataset file. So, if we have 1000 individual and our training data has

5000 training samples, it means for every fitness calculation cycle, 5000 x 1000 = 5.000.000

calculations will be done which is so expensive.



To store fitness values, a double array with the same size of population is used, so, if our

fitness array is;



Double* fitness_array = new double[MAX_POPULATION_NUMBER];



Then fitness_array[25] stores the 25.th individual`s fitness value. So, after initialising the first

population with random weights, firstly their fitness values are stored and found the best

fittest one.



void GeneticTraining::cycle (){



Chromosome m;

Chromosome c;

Chromosome s;



for(int i = 0 ; i =0 ; k--){

sum += neurons[j - index - (k + 1)] * chr [chromosomeIndex++];

}

//add bias

sum += chr[chromosomeIndex++];



//sigmoid

float calc = expf(-1 * sum);

calc += 1.0f;

calc = 1.0f / calc;



//store value

neurons[j] = calc;



}

}

//load output to output neurons







This is the all code which makes feed-forwarding on a ANN. As you can see, there are no

complex ANN data structures. We define just a neurons[NEURON_NUM] array to store

neuron values. For each layer, weight and neuron value multiplication is made for each

neuron and value is stored. Before progressing on layer, the number of neurons in previous

layers is calculated in order to index neuron value array. The conversion between 1D neuron

value array and real ANN topology is shown below,

Layer 0 Layer 1 Layer 2

Index = 0

Index = 0

Index = 0

N 2, 0









Index = 1



N1, 1









Index = 2









For N 1, 1 for N 2, 0



Previous neurons number + index previous neurons number + index

1+1=2 4+0









0 1 2 3 4

Neurons values array



So that we can represent all neuron values in a 1D array and we can map topology and array

to each other. For instance, for output neuron, loop executes three times as there are three

neurons in previous level. So their values are fetched from neurons array and multiplied by

the weights which are fetched from chromosome. After that, bias value is added (bias weight

is also stored in chromosome) and finally sigmoid function calculates the final output and

store it to neuron. We have to store neuron values somehow as all the neurons except input

layer need the values of previous neurons.



In this method, ANN is represented as just values of products. It allows as running separate

ANN instance for each thread.



Actually, we can improve this method by storing the neuron values for just previous layer

since if we are calculating 3. Hidden layer neuron values, we don’t need to know the neuron

values in first layer as we have already used those values. So we can just use a array with size

of the biggest layer and for each layer calculation, we could update that array instead of

storing all the neurons in ANN.



Another data structure is fitness array. It is simply 1D array with the size of population and

stores the fitness value of the chromosomes by one-to-one basis. It means, if we want to

learn the fitness value of 5th chromosome, we should fetch the value with index 5 Index in

fitness array.

Besides, same data reader class which is used in serial implementation is also used in CUDA

application.



3-Kernel Design



Two kernels are used within the applications which are to calculate the first fitness values

and to train ANN.



Firstly, let’s see the kernel responsible in calculating first fitness values of chromosomes.

Before that, chromosome array is created and initialized by random numbers between the

values -10, 10. Chromosome array consist of chromosomes which are simply the weight

values for the ANN. This initialization is made at host side and same number generator is

used with serial implementation



After, memory allocation is made at GPU memory for training data, initial chromosome

weights and other necessary parameters. After allocation process, all the values in question

are copied to memory locations which are allocation by “CudaMalloc”.







__global__ void calculateFirstFitness(float *individuals, ANNDATA

*trainData,int trainDatasize,float* fitnessVector)









The first parameter is chromosome array for weight values, train data is the structure for

training data, and fitness vector is the array into which weigh values will be stored. This

kernel is called from the host as;







calculateFirstFitness>>

(d_individuals,d_training_data,trainDataSize,d_fitnessArry);









For each chromosome, a thread is created so that, each thread picks a chromosome up to

calculate its fitness with following code snap.







int blockId = blockIdx.x;

int threadId = threadIdx.x;

int index = blockId * blockDim.x + threadId;

Chrosomome individual = &individuals[CONNECTION_NUM * index];









It iterates over the all training samples and calculates the error rate same as in serial

implementation. To feed forward the input data through ANN and to calculate the error rate,

device functions, which will be described in the next section, are used. After calculating the

error rate for each individual, fitness value (error rate) is written to related fitness array

index. So the aim is to decrease the fitness value (decreasing the error rate).



The most important part of the application is the kernel responsible in training of ANN. It is

the heart of the application where all the training process takes place.



The main steps are almost same with serial implementation but implemented in parallel

manner. At first, let`s see the signature of the kernel and how it is called from host side.







__global__ void evolvePopulation(float* individuals,ANNDATA

*training_data, int trainDataSize, Rand48 *random, float

*fitnessVector, Chrosomome res)









Random is the random number generator used for CUDA and will be described in detail. The

best chromosome is returned in “res” array.



As a design decision, training data is cached in the shared memory for fast access. Limited

shared memory storage is a restriction however; we assume that training data will not bigger

than 16KB. The advantage of Genetic algorithm based training is that, we don’t need large

size training data. So, there is no problem with that assumption to store training data in

shared memory. If we are using genetic algorithms for training purpose, we would better

take advantage of GA by using few training data sample.







extern __shared__ ANNData data[];



//cache trainig data to shared memory

if(threadId == 0){

for(int i = 0 ; i < trainDataSize ; i++)

data[i] = training_data[i];

}

__syncthreads();



Data array was decelerated as __shared__ since it will be stored in shared memory; it is

extern because we don’t know the size at compile time. After this declaration, the 0th thread

in each block is responsible for copying training data from global memory to shared memory.

After copy operation, barrier synchronization is used since we want to be sure training data

is stored in shared memory before starting to training process.



At this point, I want to point out a performance related term called “bank conflict”. Shared

memory in GPU divided into sections called “Bank”. The aim is to access these banks

simultaneously by different threads. However, if more than one thread wants to access same

bank, CUDA serializes and allow them to access that memory location one after another. It

decreases the parallel performance of the program as different threads try to access same

bank resides in shared memory. As a result, they cannot access the bank in parallel so they

access it in serial manner. (NVIDIA cuda programming guide)

If you look at training data fetching code, it can be seen easily that 0th threads of all blocks

try to access same training data element at the same time. This creates a bank conflict

described above. Although there are some methods to solve bank conflict problems, it was

not handled in the scope of this project. Same bank conflict problem resides while

calculating the error rate which will be discussed later.



To store training data in the shared memory provides us a low latency access to data. It is

utmost important as while calculating the fitness value, thread has to pass over all the

training data one by one and all the threads have to do same operation. Assume that our

block size is 256 and we have 128 training samples. It means 256 * 128 = 32768 times thread

access global memory. Let us say each global memory access takes about 600 cycle, 32768 *

600 = 19.660.800 cycle in order to process all training samples for a block. However, in our

approach, 0th thread will access the global memory 128 times which results 128 * 600 =

76800 cycle then for every access to shared memory will take about 3-4 cycle for thread

which means 256 * 3 = 768 cycle in total. So totally, 76800 + 768 = 77568 cycle when we use

shared memory. If we compare these results, approximately, shared memory solution will

run 253X times faster than just global memory solution.



Fitness array is also cached in shared memory for fast access. Instead of all fitness arrays, just

related part of it stored in shared memory as shown below, (let us assume that block size is 5

for easy presentation)



Fitness Array

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 ..









Block 0 Block 1 Block 2









__shared__ float s_fitVec[BLOCK_SIZE];



s_fitVec[threadId] = fitnessVector[index];

__syncthreads();









Each thread in block is responsible in fetching the related fitness value and copying to the

shared memory. As you can see global fitness array is indexed with unique thread id

however, shared fitness vector is indexed with local thread id in block as we keep just the

related part of global fitness array.

After caching kernel related data to shared memory, we can start our evolution process.

(Mutation, crossover and selection)



First of all, three random numbers are generated to select different individuals in population

for mutation operation. These random numbers are used as indices to select individuals.

After selecting individuals, mutation function is called which will be explained in next section.

Mutate operator creates a mutated vector.



The next operation is crossover operation. The way It is done described in serial

implementation as well. However, instead of generating random numbers during crossover

operation, they are generated before and passed to crossover function as a parameter. After

this operation, we have an evolved individual. So, the next step is selection which means

checking whether it reduces fitness value or not. It is done with following code snap;







float result[OUTPUT_NUM];

float err = 0.0f;

for(int i = 0 ; i < trainDataSize ; i++){

feedForward(crossVec,data[i].input,result);

err += getANNerror(result,data[i].output);

}



if(err < s_fitVec[threadId]){

s_fitVec[threadId] = err;

for(int indx = 0 ; indx < CONNECTION_NUM ; indx++){

individuals[index * CONNECTION_NUM + indx] = crossVec[indx];

}

}









If it is decided that new vector reduce the error rate, it is written to fitness vector stored in

shared memory and new individual is replaced with old one. Finally, each thread in block

copies its fitness value from shared memory to global memory. So, one cycle for evolution is

completed.



For CUDA side random number generation, a linear congruential generator has been used.

The source code was gathered from http://www.amolf.nl/~vanmeel/mdgpu/download.html. It is

created on host size, initialized by a seed than passed to kernel as a parameter. In kernel

code, before generating random number, the state is loaded first, then numbers are

generated then the state is stored again. The sample code in terms of code generation is

shown below,



rand48_loadState(rng);

rand48_nextFloat(rng);

rand48_storeState(rng);

4-Device Functions Design



Device functions called from kernel are showed below,



 __device__ void feedForward(Chrosomome chr, float* data,

float *result)



Described in previous section



 __device__ float getANNerror(float* actual, float

*desired)



Calculates the error rate between actual and desired values and returns the error value.



 __device__ void mutate(float *r1, float *r2, float

*r3,float *mutVec)



Creates a mutated vector according to, = +µ( – ) formula and

returns mutated vector in mutVec pointer.



 __device__ void crossover(float *original, float *trial,

float *randomNumbers, float *crossed)



Makes crossover between original and trial vectors by using randomNumbers and

returns the result vector in crossed pointer.



 __device__ void findMinFitness(float* fitnessVec, float

*result)



Scans the fitness array and finds the minimum fitness value and returns it in result.







4. Results



In order to test our CUDA application, sinus function has been used as benchmark since it is a

common benchmark method for ANNs and easy to generate its training data. (Find sin

benchmark sources)



For testing purpose, 5 different training set have been created between 0 and 90 degree.

These training data have been sampled for each 0.25, 0.5, 1, 1.5 and 2.5 degrees.



To approximate sinus function, 3 layers ANN has been used, 1 input layer, 1 hidden layer and

1 output layer. Besides, hidden layer consists of 6 neurons.



Default mutation rate is 0.2 and default crossover rate is 0.6. Beside, the performance is

measured for different population sizes. While comparing performance between serial and

parallel implementation, same parameters have been used.

5. Conclusion



6. Future Work









References



Bibliography



GA - Genetic Algorithms



DA – Differential Evolution



ANN - Artificial Neural Network



GPU - Graphic Processing Unit



CUDA - Compute Unified Device Architecture



CPU - Central Processing Unit



SIMD - Single instruction Multiple Data



HLSL - High Level Shading Language

Cg - C for Graphics



Related docs
Other docs by linzhengnd
Summer 2011 Grapevine - text only
Views: 0  |  Downloads: 0
metadata.project-34
Views: 0  |  Downloads: 0
FALLS TOWNSHIP
Views: 0  |  Downloads: 0
Isw2 intergenic - Labs
Views: 2  |  Downloads: 0
Vice President of International Sales
Views: 0  |  Downloads: 0
_version_1_.xlsNEWWW_1_. - Wikispaces
Views: 0  |  Downloads: 0
Study tour
Views: 4  |  Downloads: 0
chapter21_muslim_empires
Views: 3  |  Downloads: 0
By registering with docstoc.com you agree to our
privacy policy

You are almost ready to download!

You are almost ready to download!