GA

Document Sample
GA Powered By Docstoc
					      Genetic Algorithms

An Evolutionary Approach to Problem
                Solving
         Illusions of Design
   Living things in nature seem to be like they
    were designed by a skilled engineer/designer.


   Evolution Theory: Too exquisitely
    designed to be “random”.
    If not random, then what is the non-random
      process?
                  Evolution
A process of cumulative selection

Individuals with new traits are developed through
  Mutations
  Sexual Reproduction

Individuals with better traits are
  more likely to survive and are
  more likely to transfer those traits to their
  descendents.
    Key Principles of Evolution
        http://www.pbs.org/wgbh/evolution/library/11/2/e_s_4.html
   Variety: population of individuals with different set of
    traits
       Through reproduction: The importance of sexual reproductions
        versus asexual reproduction
        (http://www.pbs.org/wgbh/evolution/library/01/5/l_015_03.html)
       Random mutations


   Evaluation & Selection (through constraints in the
    environment)

   Reproduction: Transfer of traits to offsprings
From Evolution of life in Nature
   to Evolution of Solutions

Can evolutionary principles be used to
 develop solutions to problems?

Can these principles be used to do Data
 Mining?
     Computation Analogy of
          Evolution
   What is the equivalent of
    an individual
    (chromosome)?
   What is the equivalent of
    an individual having “good”
    genetic material/good set
    of traits?
   What is the equivalent of a
    population of individuals?
   What is the equivalent of
    reproduction?
   What is the equivalent of
    survival of the fittest?
            Computation Analogy of Evolution

         What is an individual
A collection of traits

Example: An individual has the following
  traits
      Color: Black or White
      Speed: Fast, Medium, Slow or Very Slow
      Intelligence: Very Smart, Smart, Somewhat Smart,
       Medium Smart, Dumb or Very Dumb
         GA – Generic Strategy
   Start with a “population” of solutions (individuals)
   Repeat
       Choose 2 solutions from the population
       With a certain probability Apply a crossover to create
        child1, child2
       With a certain probability, mutate child1 and child2
       Update the population (discard of bad solutions).
   Until stopping criteria has been met
   Output the best solution(s) in the population
           What is an individual
A collection of traits

Example: An individual has the following
  traits
        Color: Black or White
        Speed: Fast, Medium, Slow, Very Slow
        Intelligence: Very Smart, Smart, Somewhat Smart,
         Medium Smart, Dumb or Very Dumb
     Typically a coding system is used to code
      each trait. E.g., binary coding.
Representing Solutions
The Fitness Function: Evaluation of
        Individual Solutions
   A “fitness function” .
Process of Evaluating the
         Fitness
How a Population of Solutions
        Looks Like
      Reproduction: Crossover
Crossover between parents’ traits creating two children.
There are many crossover operators
   1. Randomly choose a position and cross over contents
      before that position. Crossover point

                       White    Medium      Dumb



                       Black    Slow        Smart

                                Children:
          White    Slow        Smart     Black      Medium   Dumb
Reproduction                                       Crossover


    2. Randomly choose a set of traits (genes) and
      cross over those.
     Examples: Cross over color and intelligence
               White    Medium      Dumb



               Black    Slow        Smart

                        Children:
   Black   Medium      Smart     White      Slow   Dumb
Reproduction                                        Mutation

  The purpose of mutations is to take a single solution and
    introduce some random “shock” or changes to it to
    create a new solution.

  Implementation: Randomly chose a trait for each child
    and randomly change its value to another valid value
 Black     Medium    Smart       White     Slow      Dumb



 Black    Fast      Smart       White     Slow       Intelligent
        What is the equivalent of
         survival of the fittest?
Simply give solutions with better fitness a higher
  probability of being chosen for reproduction.




Sample with replacement: solutions with bigger
  fitness will be selected more times.
               Population
What will be the composition of
 population in future
 generations?
        Example: The Traveling Sales
                  Person
   First look at the Traveling Salesman Problem
   Then see how the same principles can be
    applied for :
     Extracting rules from data to understand what
      customers are more likely to be responsive
     Extract technical trading rules

     Optimize service schedules
          The Traveling Salesman
                Problem
   A travelling salesman who has to visit a set of n cities.
   Find the order in which the salesman visits cities so as to
    minimize total distance.
   Variants of this problem are found in several domains.
   As n gets very large, exhaustive search becomes impossible
    due to the combinatorial nature of the problem.
   Need heuristic methods to find good solutions, even if these
    are not guaranteed to be the “best”.
                     The Traveling Salesman Problem

Consider the TSP problem
5 cities to visit: London, Oxford, Cambridge,
  Brighton, and Bath.
What is the best path?

                London
                         Cambridge


            Oxford               Brighton



                          Bath
                      The Traveling Salesman Problem

      Genetic Algorithms Solution
Step-1:
An individual is a “candidate solution”, a path.

Examples of candidate solutions for the TSP:
   London, Oxford, Cambridge, Brighton, Bath
   London, Bath, Oxford, Brighton, Cambridge

   Brighton, London, Cambridge, Bath, Oxford.

   …
                                 The Traveling Salesman Problem

    Coding Scheme for Candidate Solutions
             Order in Order in  Order in  Order in Order in
            sequence sequence  sequence  sequence sequence
                of       of        of        of       of
            London    Oxford Cambridge Brighton     Bath


   Example:
       London  Oxford  Cambridge  Brighton  Bath


        1        2       3       4       5
       Oxford  London  Cambridge  Bath  Brighton


            2        1       3       5       4
                     The Traveling Salesman Problem

          Step 2: Fitness Function
How “good” are the following solutions?

   London, Oxford, Cambridge, Brighton, Bath
   London, Bath, Oxford, Brighton, Cambridge

   Brighton, London, Cambridge, Bath, Oxford.
                            The Traveling Salesman Problem

  Fitness in TSP Problem: Distance Table

            Distance (in Miles)

         London     Oxford Cambridge    Brighton Bath
London        0          350      50      280    470
Oxford                     0      130     270    310
Cambridge                         0       210    340
Brighton                                     0   220
Bath                                               0
                                 The Traveling Salesman Problem
                 Creating New Solutions
Crossover Operation : Randomly choose a position and cross over
    contents before that position.
Crossover Operation for TSP: part of the first parent is copied and the
    rest is taken in the same order as in the second parent

      London        Oxford    Cambridge        Brighton       Bath
                        Crossover point
         1             2            3             4            5

        1             5             3            2             4
Reproduction                         The Traveling Salesman Problem

     London       Oxford    Cambridge           Brighton       Bath
                      Crossover point
       1            2                 3            4            5

      1            5                  3           2             4

London  Oxford  Cambridge  Brighton  Bath

London  Brighton  Cambridge Bath  Oxford
                        Child:
London  Oxford  Brighton  Cambridge Bath
 1            2                  4          3              5
Reproduction                     The Traveling Salesman Problem


     London       Oxford    Cambridge         Brighton       Bath
                      Crossover point
        1           2              3             4            5

       1           5               3            2             4

London  Oxford  Cambridge  Brighton  Bath

    London  Brighton  Cambridge Bath  Oxford
                        Child:
    London  Cambridge  Brighton Bath  Oxford


1             5              2            3              4
  Reproduction                The Traveling Salesman Problem



     Exchange the cities in second and forth place

 London  Oxford  Cambridge  Brighton  Bath

 London  Brighton  Cambridge Bath  Oxford


 London  Oxford  Cambridge  Brighton  Bath
London  Oxford  Brighton  Cambridge Bath
Reproduction
         Mutation in the TSP Problem
     Randomly changing one gene won’t work.
      London   Oxford     Cambridge   Brighton   Bath
        1        2            3          4        5


        1        2            4          4        5

     Design mutations around the “swap” concept:


        1        2            5          4        3
                 GA for TSP
   For the TSP problem we have:
     Solution representation
     A fitness evaluation function

     Crossover operations on parents

     Mutation on a single solution



   Start with a population of solutions, and let
    them evolve
      Next Step: Initialize a population
Decision parameter: population size
Let us choose 5 solutions in our population.
We will now randomly initialize the population.

     London, Oxford, Cambridge, Brighton, Bath   (-1320)
     Oxford, London, Cambridge, Bath, Brighton   (-1230)
     Cambridge, Oxford, Brighton, Bath, London   (-1140)
     Bath, London, Brighton Cambridge, Oxford    (-1400)
     Bath, Oxford, Cambridge, London, Brighton   (-990)
             Evolving Solutions for TSP
   Repeat    (Until stopping criteria has been met)
      Choose 2 solutions from the population
      With a certain “crossover probability” (say, 0.8) apply a
       crossover operator to create child1, child2
      With a certain “mutation” probability (say 0.1), mutate
       child1, child2
      Place the resulting 2 children in the population

      Selection: which 5 solutions survive – the probability of
       each individual to survive is propositional to its fitness
       function
   After stopping, output the best chromosome in the
    population for the solution
   Overview of the Selection Process

Over time through various operators,
 solutions mate and traits passed on to the
 offspring.
Children with “better” traits have a ability to
 survive.
The weak solutions gradually disappear
 from the population.
Rood solution predominate the population
   Building Solutions through
            Evolution


http://www.pbs.org/saf/1103/video/watcho
  nline.htm
             GA – Advantages:
   Not engineering : enables finding
    surprising solutions to prpblems
   Quickly and reliably solve problems that
    are hard to tackle by traditional means.
   Implicit parallelism makes GAs a very
    efficient optimization algorithm.
       Great property is the ability to find
        approximate solutions to combinatorially
        explosive problems.
         GA - Disadvantages
   A heuristic: GAs may find only near-
    optimal solutions.
   Further restrictions are the difficulties of
    choosing a suitable representation
    technique, and making the right decision
    regarding the choice of the selection
    method and the genetic operator
    probabilities
    Learning Classification Rules
              With GA Employed
A classifier can be represented as set of
                                            No       Yes
   IF (set of conditions) then (Class)

                                           No      Balance
   IF (Employed=No) Then (Class=No)
   IF (Employed=Yes AND Balance<50K) Then    <50K        >=50K
    (Yes)
   IF (Employed=Yes AND Balance>=50K AND
                                               Yes         Age
    Age<45) Then (No)
   IF (Employed=Yes AND Balance>=50K AND            <45       >=45
    Age>=45) Then (Yes)

                                                      No       Yes
    GA for Learning Classification
          Rules From Data
Representation of Rules
    All rules represent the class Yes
    (Conjunction of Conditions):
    Each position is a trait and its value

                     Marital     Has a     Age>40
                     Status      Job?
                    Married       Yes         No
   In addition to all valid values: each attribute can take an empty
    condition
                                 Marital     Has a     Age>40
                                 Status       Job?

                               Married        *         No
Reproduction
                       Crossover
                   Marital         Has a   Age>40
                   Status          Job?

                   Married         Yes        No


                   Marital         Has a    Age>40
                   Status          Job?
                   Divorced         No         Yes

   Marital   Has a      Age>40             Marital    Has a   Age>40
   Status    Job?                          Status     Job?
   Married    No             Yes           Divorced   Yes      No
Reproduction
                    Mutation
  Random changes in attribute values

                Marital   Has a   Age>40
                Status    Job?
               Married     Yes      No


               Marital    Has a   Age>40
               Status     Job?
               Divorced    No      No
                   Fitness
   Support
   Confidence
   Lift
   Support*Lift
                              Selection
   Problem with regular selection mechanism
   Want to develop a variety of rules which cover minority groups as
    well
   Solution: “Segmented” Elections
    Each example votes for one of the rules which apply to it. For
    example:
        Assume our population of rules includes:
             (Marital status: Married, Has_a_job: Yes, Age>40?: Yes)
             Single, Empty, Yes
             Empty, Yes, Empty
             Divorced, Empty, No
        The example: (Single, Yes, Yes) can vote for either:
             Single, Empty, Yes
             Empty, Yes, Empty
        The probability of voting for either rule is proportional to the fitness of
         the rule
Selection: The survival of the fittest

   Each rule gets a score that is the proportion of examples
    which it applies and that voted for this rule
       Example:
            Single, Empty, Yes – 30%
            Empty, Yes, Empty – 70%


   Each rule competes only with rules that apply to the
    same set of examples.
   This allows a form of niching. Rules applying to small
    subsets of examples can survive.
Another rule representation : Tree
                                  And




                  And                             =




         >                        >        Employed   Yes




   Age       45         Balance       50
                 Crossover Operations
                                                                          And
                               And




                                                 =
           And                                                   >                     =




      >                    >          Employed       Yes   Age       45     Employed       No




Age       45     Balance         50
Mutation: Randomly changing subtree

                                    And




                                                      =
                And




           <
           >                    >          Employed       Yes




     Age       45     Balance         50
Learn Trading Rules from data

Individual:
  A “buy” / “sell” rule.
  Specifying conditions for sell or buy

Representation: tree
          Example for a “buy” rule


          >                  >




High at   High at   Low at       Low at
  T-2       T-2       T-4          T-3
              Fitness Function
   Return over a period of time (Jonsson et al.)
       Adjusted to a benchmark strategy
                                       Crossover
                    and                                             and




          >                        >                      >                        <




High at   High at         Low at       Low at   High at   High at         low at       low at
  T-1       T-2             T-4          T-3      T-1       T-2             T-4          T-1
Mutation: Add randomly to the tree
                                       and




                    and                                  <




                                                low at       low at
          >                        >
                                                  T-4          T-1




High at   High at         Low at       Low at
  T-1       T-2             T-4          T-3
            Maintaining diversity
   Problem with regular selection mechanism
   Want to develop a variety of rules which cover different scenarios,
    particularly atypical ones
   Solution: Niching
        Fitness function is adjusted to accommodate diversity
        Example: Returns divided by periods in which conditions of the rule
         apply
Performance of GA for Trading
           Rules
   Some reported failure (Allen F.,
    Karajalainen R. (1993) while others
    success over benchmarks such as buy &
    hold strategies.
   Main argument: Efficiency Market
    Hypothesis and Non-Predictability
   Technical rules have been shown to
    outperform
            GA a Methodology
   GA is a methodology not a solution
   Provides tools for engineering a system to
    generate solutions
     Need to formulate the right questions
     Good engineering
          Risk of overfitting
         Design a good fitness function

         Incorporate important factors for rules to capture
Topics for Final Exam
                                    Final Exam
   The Data Mining Tasks (regression, classification, etc.)
   Predictive Vs. Prescriptive
   Data drive vs. Model/Theory driven
   DBMS, Data Warehouse & OLAP
   Modeling: Problem formulation, Model Building, Evaluation
   Classification Decision Trees
   Association rules (representation, understand measures)
   K-Nearest Neighbor Classification
   Personalization with Collaborative Filtering
   Clustering (K-means clustering)
   Text Mining
        Representation of documents in tabular format for data mining tasks (association
         rules, classification)
        Information Retrieval
             Measures (precision, recall)
             Possible applications
   Genetic Algorithms:
        The principles of GA
        How to design solutions to problems with GA

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:9
posted:12/3/2011
language:English
pages:57