Genetic Algorithms
An Evolutionary Approach to Problem
Solving
Illusions of Design
Living things in nature seem to be like they
were designed by a skilled engineer/designer.
Evolution Theory: Too exquisitely
designed to be “random”.
If not random, then what is the non-random
process?
Evolution
A process of cumulative selection
Individuals with new traits are developed through
Mutations
Sexual Reproduction
Individuals with better traits are
more likely to survive and are
more likely to transfer those traits to their
descendents.
Key Principles of Evolution
http://www.pbs.org/wgbh/evolution/library/11/2/e_s_4.html
Variety: population of individuals with different set of
traits
Through reproduction: The importance of sexual reproductions
versus asexual reproduction
(http://www.pbs.org/wgbh/evolution/library/01/5/l_015_03.html)
Random mutations
Evaluation & Selection (through constraints in the
environment)
Reproduction: Transfer of traits to offsprings
From Evolution of life in Nature
to Evolution of Solutions
Can evolutionary principles be used to
develop solutions to problems?
Can these principles be used to do Data
Mining?
Computation Analogy of
Evolution
What is the equivalent of
an individual
(chromosome)?
What is the equivalent of
an individual having “good”
genetic material/good set
of traits?
What is the equivalent of a
population of individuals?
What is the equivalent of
reproduction?
What is the equivalent of
survival of the fittest?
Computation Analogy of Evolution
What is an individual
A collection of traits
Example: An individual has the following
traits
Color: Black or White
Speed: Fast, Medium, Slow or Very Slow
Intelligence: Very Smart, Smart, Somewhat Smart,
Medium Smart, Dumb or Very Dumb
GA – Generic Strategy
Start with a “population” of solutions (individuals)
Repeat
Choose 2 solutions from the population
With a certain probability Apply a crossover to create
child1, child2
With a certain probability, mutate child1 and child2
Update the population (discard of bad solutions).
Until stopping criteria has been met
Output the best solution(s) in the population
What is an individual
A collection of traits
Example: An individual has the following
traits
Color: Black or White
Speed: Fast, Medium, Slow, Very Slow
Intelligence: Very Smart, Smart, Somewhat Smart,
Medium Smart, Dumb or Very Dumb
Typically a coding system is used to code
each trait. E.g., binary coding.
Representing Solutions
The Fitness Function: Evaluation of
Individual Solutions
A “fitness function” .
Process of Evaluating the
Fitness
How a Population of Solutions
Looks Like
Reproduction: Crossover
Crossover between parents’ traits creating two children.
There are many crossover operators
1. Randomly choose a position and cross over contents
before that position. Crossover point
White Medium Dumb
Black Slow Smart
Children:
White Slow Smart Black Medium Dumb
Reproduction Crossover
2. Randomly choose a set of traits (genes) and
cross over those.
Examples: Cross over color and intelligence
White Medium Dumb
Black Slow Smart
Children:
Black Medium Smart White Slow Dumb
Reproduction Mutation
The purpose of mutations is to take a single solution and
introduce some random “shock” or changes to it to
create a new solution.
Implementation: Randomly chose a trait for each child
and randomly change its value to another valid value
Black Medium Smart White Slow Dumb
Black Fast Smart White Slow Intelligent
What is the equivalent of
survival of the fittest?
Simply give solutions with better fitness a higher
probability of being chosen for reproduction.
Sample with replacement: solutions with bigger
fitness will be selected more times.
Population
What will be the composition of
population in future
generations?
Example: The Traveling Sales
Person
First look at the Traveling Salesman Problem
Then see how the same principles can be
applied for :
Extracting rules from data to understand what
customers are more likely to be responsive
Extract technical trading rules
Optimize service schedules
The Traveling Salesman
Problem
A travelling salesman who has to visit a set of n cities.
Find the order in which the salesman visits cities so as to
minimize total distance.
Variants of this problem are found in several domains.
As n gets very large, exhaustive search becomes impossible
due to the combinatorial nature of the problem.
Need heuristic methods to find good solutions, even if these
are not guaranteed to be the “best”.
The Traveling Salesman Problem
Consider the TSP problem
5 cities to visit: London, Oxford, Cambridge,
Brighton, and Bath.
What is the best path?
London
Cambridge
Oxford Brighton
Bath
The Traveling Salesman Problem
Genetic Algorithms Solution
Step-1:
An individual is a “candidate solution”, a path.
Examples of candidate solutions for the TSP:
London, Oxford, Cambridge, Brighton, Bath
London, Bath, Oxford, Brighton, Cambridge
Brighton, London, Cambridge, Bath, Oxford.
…
The Traveling Salesman Problem
Coding Scheme for Candidate Solutions
Order in Order in Order in Order in Order in
sequence sequence sequence sequence sequence
of of of of of
London Oxford Cambridge Brighton Bath
Example:
London Oxford Cambridge Brighton Bath
1 2 3 4 5
Oxford London Cambridge Bath Brighton
2 1 3 5 4
The Traveling Salesman Problem
Step 2: Fitness Function
How “good” are the following solutions?
London, Oxford, Cambridge, Brighton, Bath
London, Bath, Oxford, Brighton, Cambridge
Brighton, London, Cambridge, Bath, Oxford.
The Traveling Salesman Problem
Fitness in TSP Problem: Distance Table
Distance (in Miles)
London Oxford Cambridge Brighton Bath
London 0 350 50 280 470
Oxford 0 130 270 310
Cambridge 0 210 340
Brighton 0 220
Bath 0
The Traveling Salesman Problem
Creating New Solutions
Crossover Operation : Randomly choose a position and cross over
contents before that position.
Crossover Operation for TSP: part of the first parent is copied and the
rest is taken in the same order as in the second parent
London Oxford Cambridge Brighton Bath
Crossover point
1 2 3 4 5
1 5 3 2 4
Reproduction The Traveling Salesman Problem
London Oxford Cambridge Brighton Bath
Crossover point
1 2 3 4 5
1 5 3 2 4
London Oxford Cambridge Brighton Bath
London Brighton Cambridge Bath Oxford
Child:
London Oxford Brighton Cambridge Bath
1 2 4 3 5
Reproduction The Traveling Salesman Problem
London Oxford Cambridge Brighton Bath
Crossover point
1 2 3 4 5
1 5 3 2 4
London Oxford Cambridge Brighton Bath
London Brighton Cambridge Bath Oxford
Child:
London Cambridge Brighton Bath Oxford
1 5 2 3 4
Reproduction The Traveling Salesman Problem
Exchange the cities in second and forth place
London Oxford Cambridge Brighton Bath
London Brighton Cambridge Bath Oxford
London Oxford Cambridge Brighton Bath
London Oxford Brighton Cambridge Bath
Reproduction
Mutation in the TSP Problem
Randomly changing one gene won’t work.
London Oxford Cambridge Brighton Bath
1 2 3 4 5
1 2 4 4 5
Design mutations around the “swap” concept:
1 2 5 4 3
GA for TSP
For the TSP problem we have:
Solution representation
A fitness evaluation function
Crossover operations on parents
Mutation on a single solution
Start with a population of solutions, and let
them evolve
Next Step: Initialize a population
Decision parameter: population size
Let us choose 5 solutions in our population.
We will now randomly initialize the population.
London, Oxford, Cambridge, Brighton, Bath (-1320)
Oxford, London, Cambridge, Bath, Brighton (-1230)
Cambridge, Oxford, Brighton, Bath, London (-1140)
Bath, London, Brighton Cambridge, Oxford (-1400)
Bath, Oxford, Cambridge, London, Brighton (-990)
Evolving Solutions for TSP
Repeat (Until stopping criteria has been met)
Choose 2 solutions from the population
With a certain “crossover probability” (say, 0.8) apply a
crossover operator to create child1, child2
With a certain “mutation” probability (say 0.1), mutate
child1, child2
Place the resulting 2 children in the population
Selection: which 5 solutions survive – the probability of
each individual to survive is propositional to its fitness
function
After stopping, output the best chromosome in the
population for the solution
Overview of the Selection Process
Over time through various operators,
solutions mate and traits passed on to the
offspring.
Children with “better” traits have a ability to
survive.
The weak solutions gradually disappear
from the population.
Rood solution predominate the population
Building Solutions through
Evolution
http://www.pbs.org/saf/1103/video/watcho
nline.htm
GA – Advantages:
Not engineering : enables finding
surprising solutions to prpblems
Quickly and reliably solve problems that
are hard to tackle by traditional means.
Implicit parallelism makes GAs a very
efficient optimization algorithm.
Great property is the ability to find
approximate solutions to combinatorially
explosive problems.
GA - Disadvantages
A heuristic: GAs may find only near-
optimal solutions.
Further restrictions are the difficulties of
choosing a suitable representation
technique, and making the right decision
regarding the choice of the selection
method and the genetic operator
probabilities
Learning Classification Rules
With GA Employed
A classifier can be represented as set of
No Yes
IF (set of conditions) then (Class)
No Balance
IF (Employed=No) Then (Class=No)
IF (Employed=Yes AND Balance=50K
(Yes)
IF (Employed=Yes AND Balance>=50K AND
Yes Age
Age=50K AND =45
Age>=45) Then (Yes)
No Yes
GA for Learning Classification
Rules From Data
Representation of Rules
All rules represent the class Yes
(Conjunction of Conditions):
Each position is a trait and its value
Marital Has a Age>40
Status Job?
Married Yes No
In addition to all valid values: each attribute can take an empty
condition
Marital Has a Age>40
Status Job?
Married * No
Reproduction
Crossover
Marital Has a Age>40
Status Job?
Married Yes No
Marital Has a Age>40
Status Job?
Divorced No Yes
Marital Has a Age>40 Marital Has a Age>40
Status Job? Status Job?
Married No Yes Divorced Yes No
Reproduction
Mutation
Random changes in attribute values
Marital Has a Age>40
Status Job?
Married Yes No
Marital Has a Age>40
Status Job?
Divorced No No
Fitness
Support
Confidence
Lift
Support*Lift
Selection
Problem with regular selection mechanism
Want to develop a variety of rules which cover minority groups as
well
Solution: “Segmented” Elections
Each example votes for one of the rules which apply to it. For
example:
Assume our population of rules includes:
(Marital status: Married, Has_a_job: Yes, Age>40?: Yes)
Single, Empty, Yes
Empty, Yes, Empty
Divorced, Empty, No
The example: (Single, Yes, Yes) can vote for either:
Single, Empty, Yes
Empty, Yes, Empty
The probability of voting for either rule is proportional to the fitness of
the rule
Selection: The survival of the fittest
Each rule gets a score that is the proportion of examples
which it applies and that voted for this rule
Example:
Single, Empty, Yes – 30%
Empty, Yes, Empty – 70%
Each rule competes only with rules that apply to the
same set of examples.
This allows a form of niching. Rules applying to small
subsets of examples can survive.
Another rule representation : Tree
And
And =
> > Employed Yes
Age 45 Balance 50
Crossover Operations
And
And
=
And > =
> > Employed Yes Age 45 Employed No
Age 45 Balance 50
Mutation: Randomly changing subtree
And
=
And
> Employed Yes
Age 45 Balance 50
Learn Trading Rules from data
Individual:
A “buy” / “sell” rule.
Specifying conditions for sell or buy
Representation: tree
Example for a “buy” rule
> >
High at High at Low at Low at
T-2 T-2 T-4 T-3
Fitness Function
Return over a period of time (Jonsson et al.)
Adjusted to a benchmark strategy
Crossover
and and
> > > >
T-4 T-1
High at High at Low at Low at
T-1 T-2 T-4 T-3
Maintaining diversity
Problem with regular selection mechanism
Want to develop a variety of rules which cover different scenarios,
particularly atypical ones
Solution: Niching
Fitness function is adjusted to accommodate diversity
Example: Returns divided by periods in which conditions of the rule
apply
Performance of GA for Trading
Rules
Some reported failure (Allen F.,
Karajalainen R. (1993) while others
success over benchmarks such as buy &
hold strategies.
Main argument: Efficiency Market
Hypothesis and Non-Predictability
Technical rules have been shown to
outperform
GA a Methodology
GA is a methodology not a solution
Provides tools for engineering a system to
generate solutions
Need to formulate the right questions
Good engineering
Risk of overfitting
Design a good fitness function
Incorporate important factors for rules to capture
Topics for Final Exam
Final Exam
The Data Mining Tasks (regression, classification, etc.)
Predictive Vs. Prescriptive
Data drive vs. Model/Theory driven
DBMS, Data Warehouse & OLAP
Modeling: Problem formulation, Model Building, Evaluation
Classification Decision Trees
Association rules (representation, understand measures)
K-Nearest Neighbor Classification
Personalization with Collaborative Filtering
Clustering (K-means clustering)
Text Mining
Representation of documents in tabular format for data mining tasks (association
rules, classification)
Information Retrieval
Measures (precision, recall)
Possible applications
Genetic Algorithms:
The principles of GA
How to design solutions to problems with GA