Documents
Resources
Learning Center
Upload
Plans & pricing Sign in
Sign Out
Get this document free

Machine Learning

VIEWS: 12 PAGES: 103

									11/16
Learning
  Improving the performance
   of the agent
        -w.r.t. the external performance
           measure




                                           Dimensions:
                                            What can be learned?
                                            --Any of the boxes representing
                                               the agent’s knowledge
                                             --action description, effect probabilities,
                                               causal relations in the world (and the
                                               probabilities of causation), utility models
                                               (sort of through credit assignment), sensor
                                              data interpretation models
                                            What feedback is available?
                                              --Supervised, unsupervised,
                                                “reinforcement” learning
                                                   --Credit assignment problem
                                            What prior knowledge is available?
                                              -- “Tabularasa” (agent’s head is a blank
                                                 slate) or pre-existing knowledge
           Dimensions of Learning
•   “Representation” of the knowledge               •   Degree of Background Knowledge
•   Degree of Guidance                                   – Tabula Rasa
     – Supervised                                             • No background knowledge other
                                                                than the training examples
         • Teacher provides training
           examples & solutions                          – Knowledge-based learning
               –   E.g. Classification                        • Examples are interpreted in the
     – Unsupervised                                             context of existing knowledge
         • No assistance from teacher               •   Inductive vs. deductive learning
               – E.g. Clustering; Inducing hidden        – If you do have background
                 variables
                                                           knowledge, then a question is
     – In-between                                          whether the learned knowledge is
         • Either feedback is given only for               “entailed” by the background
           some of the examples                            knowledge or not
               –   Semi-supervised Learning                          –   (Entailment can be logical or
         • Or feedback is provided after a                               probabilistic)
           sequence of decision are made                      • If it is entailed, then it is called
               – Reinforcement Learning                         “deductive” learning
                                                              • If it is not entailed, then it is called
                                                                inductive learning
              Inductive Learning
           (Classification Learning)
•   Given a set of labeled training
    examples
     – Find the rule that underlies the
        labeling
          • (so you can use it to predict
             future unlabeled examples)
     – Tabula Rasa, fully supervised
•   Qns:
     – How do we test a learner?            --similar to predicting credit card fraud,
     – Can learning ever work?                predicting who are likely to respond to junk mail
                                              predicting what items you are likely to buy
     – How do we compare learners?
                                                        Closely related to
                                                          * Function learning
                                                            or curve-fitting
                                                            (regression)
           K-Nearest Neighbor
• An unseen instance’s class is determined by its
  nearest neighbor
   – Or the majority label of its nearest k neighbors
• Real Issue: Getting the right distance metric to
  decide who are the neighbors…
• One of the most obvious classification algorithms
   – Skips the middle stage and lets the examples be their
     own pattern
      • A variation is to “cluster” the training examples and remember
        the prototypes for each cluster (reduces the number of things
        remembered)
              Inductive Learning
           (Classification Learning)
•   How are learners tested?
     – Performance on the test data (not
       the training data)
     – Performance measured in terms of
       positive
•   (when) Can learning work?
     – Training and test examples the
       same?
     – Training and test examples have
       no connection?
     – Training and Test examples from
       the same distribution
A good hypothesis will                     False +ve: The learner classifies the example
 have fewest false positives               as +ve, but it is actually -ve
(Fh+) and fewest false negatives (Fh-)
[Ideally, we want them to be zero]

Rank(h) = f(Fh+, Fh-)
  --f depends on the domain
    by default f=Sum; but can give
     different weights to different
errors (Cost-based learning)




Medical domain                                   Ranking hypotheses
 --Higher cost for F-                      H1: Russell waits only in italian restaurants
      --But also high cost for F+            false +ves: X10,
Spam Mailer                                  false –ves: X1,X3,X4,X8,X12
 --Very low cost for F+
                                           H2: Russell waits only in cheap french restaurants
   --higher cost for F-                      False +ves:
Terrorist/Criminal Identification            False –ves: X1,X3,X4,X6,X8,X12
 --High cost for F+ (for the individual)
 --High cost for F- (for the society)
        What is a reasonable goal in
          designing a learner?
• (Idea) Learner must classify all   • The goal of a learner then is to
  new instances (test cases)           produce a probably
  correctly always                     approximately correct (PAC)
• Always?                              hypothesis, for a given
    – May be the training samples      approximation (error rate) e
      are not completely               and probability d.
      representative of the test     • When is a learner A better than
      samples                          learner B?
    – So, we go with “probably”            – For the same e,d bounds, A
•   Correctly?                               needs fewer trailing samples
    – May be impossible if the               than B to reach PAC.
      training data has noise (the
      teacher may make mistakes
      too)
    – So, we go with
                                       Learning
      “approximately”                  Curves
        Complexity measured in number of
        Samples required to PAC-learn
                  Inductive Learning
               (Classification Learning)
    •   Given a set of labeled examples,          •   Main variations:
        and a space of hypotheses                 •   Bias: the “sort” of rule are you
         – Find the rule that underlies               looking for?
            the labeling                                    – If you are looking for only
                                                              conjunctive hypotheses,
              • (so you can use it to                         there are just 3n
                 predict future unlabeled         –   Search:
                 examples)                             – Greedy search
         – Tabularasa, fully supervised                     – Decision tree learner
                                                       – Systematic search
    • Idea:
                                                            – Version space learner
         – Loop through all hypotheses                 – Iterative search
              • Rank each hypothesis in                     – Neural net learner
                terms of its match to data
              • Pick the best hypothesis                         It can be shown that
                                                                  sample complexity of PAC
The main problem is that
                                                                  learning is proportional to
 the space of hypotheses is too large
      Given examples described in terms of n boolean variables    1/e, 1/d AND log |H|
                         n
           There are 2 2 different hypotheses
           For 6 features, there are 18,446,744,073,709,551,616 hypotheses
                                                                             11/21
                        Thanks Giving       and



Suppose you randomly reshuffled the world, and you have 100 people on your
   street (randomly sampled from the entire world).
• On your street, there will be 5 people from US. Suppose they are a family.
   This family:
     – Will own 2 of the 8 cars on the entire street
     – Will own 60% of the wealth of the whole street
     – Of the 100 people on the street, you (and you alone) will have had a college
       education
•   …and of your neighbors
     – Nearly half (50) of your neighbors would suffer from malnutrition.
     – About 13 of the people would be chronically hungry.
     – One in 12 of the children on your street would die of some mostly preventable
       disease by the age of 5: from measles, malaria, or diarrhea. One in 12.

“If we came face to face with these inequities every day, I believe we would
    already be doing something more about them.”
                           --William H. Gates


                                               http://www.pbs.org/now/transcript/transcript_gates.html
                Administrative
• Homework 4 is gathering mass…
   – (insert ominous sound effects here)
• Project 4 will be the last coding project
   – You can submit until tomorrow (make-up class)
     without penalty; and with 3p% penalty until Monday
• Make-up class tomorrow BYENG 210
   – Note: This is the second floor of Brickyard bldg (near
     the instructional labs; and on the same floor as advising
     office)
    We defined inductive learning
            problem…
• So, let’s get started learning already..
         More expressive the bias,
           larger the hypothesis space
           Slower the learning
         --Line fitting is faster than
            curve fitting
         --Line fitting may miss non-line
           patterns



“Gavagai” example.
  -The “whole object” bias in
     language learning.
Brazilians…
      Uses different biases in predicting Russel’s waiting habbits
                                                                      Decision Trees
                                                                      --Examples are used to
                                                                        --Learn topology
If patrons=full and day=Friday                     K-nearest            --Order of questions
  then wait (0.3/0.7)                              neighbors
If wait>60 and Reservation=no
  then wait (0.4/0.9)


     Association rules
     --Examples are used to
       --Learn support and
         confidence of association
         rules                                                                                                SVMs
                                                                                               Neural Nets
                                                                                               --Examples are used to
                                                                                                 --Learn topology
                                                                                                 --Learn edge weights
                                                    Naïve bayes
                                                    (bayesnet learning)
                                                    --Examples are used to
                                 Russell waits
                                                      --Learn topology
RW    None   some   full
                                                      --Learn CPTs
T     0.3    0.2    0.5
F     0.4    0.3    0.3


                    Wait time?          Patrons?            Friday?
      The Many Splendors of Bias
 Training Examples
     (labelled)                                    The Space of
                                                    Hypotheses
                             “Bias” filter
Pick the best hypothesis
 that fits the examples

                           Bias is any knowledge other than the
                           training examples that is used to
 Use the hypothesis to     restrict the space of hypotheses considered
  predict new instances
                           Can be domain independent or
                           domain-specific
                          Biases
• Domain-indepdendent              • Domain-specific bias
  bias                                – ALL domain knowledge is
  – Syntactic bias                      bias!
     • Look for ―lines‖                  • Background theories &
     • Look for naïve bayes                Explanations
       nets                                  – The relevant features of the
                                               data point are those that
     • ―Whole object‖ bias                     take part in explaining why
         – Gavagai problem                     the data point has that label
  – Preference bias                      • Weak domain
     • Look for ―small‖ decision           theories/Determinations
       trees                                 – Nationality determines
                                               language
                                             – Color of the skin determines
                                               degree of sunburn
                                         • Relevant features
                                             – I know that certain phrases
                                               are relevant for spam/non-
                                               spam classification
           Bias & Learning cost
• Strong Bias  smaller filtered hypothesis space
   – Lower learning cost! (because you need fewer examples to
     rank the hypotheses!)
       • Suppose I have decided that hair length determines pass/fail
         grade in the class, then I can ―learn‖ the concept with a
         _single_ example!
   – Cuts down the concepts you can learn accurately
• Strong Bias fewer parameters for describing the
  hypthesis
   – Lower learning cost!!
   – For discrete variable learning cases, the sample complexity
     can be shown to be proportional to log(|H|) where H is the
     hypothesis space
                                                      Note: This result only holds for
                                                        finite hypothesis spaces
                                                      (e.g. not valid for the space of
                                                        line hypotheses!)

                        PAC learning
A learner is considered PAC ( probably approximately correct ) with respect
to an error rate e and a probability d (0  e , d  1) if
Pr(the learner makes more than e fractionof errors )  d


It can be shown that in the worst case, a learner needs N training samples
to PAC  learn a training concept, where N is related to e , d and the size
of the hypothesis space | H | as follows       H

     1    1                                                                  e
N     log  log | H | 
     e d                                                                 f
                                                  Hbad
          Bias & Learning Accuracy
• Having weak bias




                           Fraction incorectly classified
  (large hypothesis
                                                            Test (prediction) error
  space)
  – Allows us to capture                                              Training error

    more concepts
  – ..increases learning
    cost
  – May lead to over-
    fitting
      Tastes Great/Less Filling
• Biases are essential for survival of an agent!
  – You must need biases to just make learning
    tractable
     • ―Whole object bias‖ used by kids in language acquisition
• Biases put blinders on the learner—filtering
  away (possibly more accurate) hypotheses
  – ―God doesn’t play dice with the universe‖
    (Einstein)
  – ―Color of Skin relevant to predicting crime‖ (Billy
    Bennett—Former Education Secretary)
Mirror, Mirror, on the wall
 Which learning bias is the best of all?                                        Uses different biases in predicting Russel’s waiting habbits
                                                                                                                                               Decision Trees
                                                                                                                                               --Examples are used to
                                                                                                                                                 --Learn topology
                                                                          If patrons=full and day=Friday                                         --Order of questions
                                                                            then wait (0.3/0.7)
                                                                          If wait>60 and Reservation=no
                                                                            then wait (0.4/0.9)


                                                                               Association rules
                                                                               --Examples are used to
                                                                                 --Learn support and
Well, there is no such thing, silly!                                               confidence of association
                                                                                   rules

                                                                                                                                                                        Neural Nets
                                                                                                                                                                        --Examples are used to
 --Each bias makes it easier to learn some patterns and harder                                                                                                            --Learn topology
                                                                                                                                                                          --Learn edge weights
                                                                                                                             Naïve bayes
(or impossible) to learn others:                                                                           Russell waits
                                                                                                                             (bayesnet learning)
                                                                                                                             --Examples are used to
                                                                          RW    None   some   full
                                                                                                                               --Learn topology
                                                                                                                               --Learn CPTs
                                                                          T     0.3    0.2    0.5


   -A line-fitter can fit the best line to the data very fast but         F     0.4    0.3    0.3


                                                                                              Wait time?          Patrons?           Friday?



won’t know what to do if the data doesn’t fall on a line
   --A curve fitter can fit lines as well as curves… but takes                 --Decision trees can capture all
longer time to fit lines than a line fitter.                                   boolean functions
-- Different types of bias classes (Decision trees, NNs etc)                    --but are faster at capturing
provide different ways of naturally carving up the space of all                conjunctive boolean functions
possible hypotheses
                                                                               --Neural nets can capture all
So a more reasonable question is:                                              boolean or real-valued functions
 -- What is the bias class that has a specialization corresponding               --but are faster at capturing
to the type of patterns that underlie my data?                                 linearly separable functions
 Bias can be seen as a sneaky way of letting background                       --Bayesian learning can capture
knowledge in..                                                                 all probabilistic dependencies
-- In this bias class, what is the most restrictive bias that still can         But are faster at capturing single
capture the true pattern in the data?                                          level dependencies (naïve bayes
                                                                               classifier)
          A classification learning example
     Predicting when Rusell will wait for a table




--similar to book preferences, predicting credit card fraud,
  predicting when people are likely to respond to junk mail
Learning Decision Trees---How?

                        Basic Idea:
                         --Pick an attribute
                         --Split examples in terms
                               of that attribute
                            --If all examples are +ve
                               label Yes. Terminate
                           --If all examples are –ve
                               label No. Terminate
                           --If some are +ve, some are –ve
                              continue splitting recursively




                Which one to pick?
        Decision Trees & Sample
              Complexity
• Decision Trees can Represent any boolean
  function
• ..So PAC-learning decision trees should be
  exponentially hard (since there are 22^n
  hypotheses)
• ..however, decision tree learning algorithms use
  greedy approaches for learning a good (rather than
  the optimal) decision tree
   – Thus, using greedy rather than exhaustive search of
     hypotheses space is another way of keeping complexity
     low (at the expense of losing PAC guarantees)
Depending on the order we pick, we can get smaller or bigger trees




      Which tree is better?
       Why do you think so??
Basic Idea:
 --Pick an attribute
 --Split examples in terms
       of that attribute
    --If all examples are +ve
       label Yes. Terminate
   --If all examples are –ve
       label No. Terminate
   --If some are +ve, some are –ve
      continue splitting recursively
        --if no attributes left to split?
          (label with majority element)




           Would you split on
            patrons or Type?
 The Information Gain                                  P+ : N+ /(N++N-)
                                                                                   # expected comparisons
                                                                                   needed to tell whether a
 Computation                                           P- : N- /(N++N-)            given example is +ve or -ve

                                                       I(P+ ,, P-) = -P+ log(P+) - P- log(P- )
                                              N+
                                              N-
                      Splitting on                                                  The difference
                                                                                    is the information
                      feature fk                                                    gain

                            N1+           N2+             Nk+                       So, pick the feature
                            N1-           N2-             Nk-                       with the largest Info Gain
                     I(P1+ ,, P1-)   I(P2+ ,, P2-)   I(Pk+ ,, Pk-)                  I.e. smallest residual info

                                                             k
Given k mutually exclusive and exhaustive
events E1….Ek whose probabilities are                        S       [Ni+ + Ni- ]/[N+ + N-] I(Pi+ ,, Pi-)

p1….pk                                                       i=1
The “information” content (entropy) is defined as

  S i -pi log2 pi
A split is good if it reduces the entropy..
11/22 (Make-up for 11/28)
 The Information Gain                                  P+ : N+ /(N++N-)
                                                                                   # expected comparisons
                                                                                   needed to tell whether a
 Computation                                           P- : N- /(N++N-)            given example is +ve or -ve

                                                       I(P+ ,, P-) = -P+ log(P+) - P- log(P- )
                                              N+
                                              N-
                      Splitting on                                                  The difference
                                                                                    is the information
                      feature fk                                                    gain

                            N1+           N2+             Nk+                       So, pick the feature
                            N1-           N2-             Nk-                       with the largest Info Gain
                     I(P1+ ,, P1-)   I(P2+ ,, P2-)   I(Pk+ ,, Pk-)                  I.e. smallest residual info

                                                             k
Given k mutually exclusive and exhaustive
events E1….Ek whose probabilities are                        S       [Ni+ + Ni- ]/[N+ + N-] I(Pi+ ,, Pi-)

p1….pk                                                       i=1
The “information” content (entropy) is defined as

  S i -pi log2 pi
A split is good if it reduces the entropy..
                             A simple example


Ex Masochistic Anxious   Nerdy   HATES
                                 EXAM    V(M) = 2/4 * I(1/2,1/2) + 2/4 * I(1/2,1/2)
1   F          T         F       Y               = 1

2   F          F         T       N

                                         V(A) = 2/4 * I(1,0) + 2/4 * I(0,1)
3   T          F         F       N               = 0
4   T          T         T       Y



                                         V(N) = 2/4 * I(1/2,1/2) + 2/4 * I(1/2,1/2)
                                                 = 1



                   So Anxious is the best attribute to split on
                   Once you split on Anxious, the problem is solved
                                                          Lesson:
                              Evaluating the Decision Trees
                                                          Every bias makes
                                                                                something easier to
                                                                                learn and others
                                                                                harder to learn…




                                                                           “Majority” function
                         Russell Domain                                    (say yes if majority of
                                                                            attributes are yes)




Learning curves… Given N examples, partition them into Ntr the training set and Ntest the test instances
 Loop for i=1 to |Ntr|                                                       Can also consider
  Loop for Ns in subsets of Ntr of size I                                     different Ntr and Ntest
    Train the learner over Ns                                                 partitions
    Test the learned pattern over Ntest and compute the accuracy (%correct)
                                                            This was used in the class;
                                                            but the next one is the correct
                                                            replacement
                            Decision Stumps
    •    Decision stumps are decision trees
         where the leaf nodes do not
         necessarily have all +ve or all –ve                             N+
         training examples                                               N-
    •    In general, with each leaf node, we           Splitting on
         can associate a probability p+ that if
         we reach that leaf node, the example          feature fk
         is classified +ve
               • When you reach that node, you toss         N1+       N2+         Nk+
                 a biased coin (whose probability of        N1-       N2-         Nk-
                 heads is p+ and output +ve if the
                 coin comes heads)
          – In normal decision trees, p+ is 0 or 1
          – In decision stumps, 0 <= p+ <= 1
                                         P+= N1+ / N1++N1-
Majority vote is better than tossing coin…

        Sometimes, the best decision tree for a problem
         could be a decision stump (see coin toss example next)
                     Decision Stumps
• Decision stumps are decision
  trees where the leaf nodes do not
  necessarily have all +ve or all –                          N+
  ve training examples                                       N-
    – Could happen either because           Splitting on
      examples are noisy and mis-           feature fk
      classified or because you want to
      stop before reaching pure leafs            N1+       N2+     Nk+
• When you reach that node, you                  N1-       N2-     Nk-
  return the majority label as the
  decision.
        • (We can associate a confidence       P+= N1+ / N1++N1-
          with that decision using the P+
          and P-)


  Sometimes, the best decision tree for a problem
   could be a decision stump (see coin toss example next)
Problems with Info. Gain. Heuristics

 • Feature correlation: We are splitting on one feature at a time
          • The Costanza party problem
     – No obvious easy solution…
 • Overfitting: We may look too hard for patterns where there are none
    – E.g. Coin tosses classified by the day of the week, the shirt I was
      wearing, the time of the day etc.
    – Solution: Don’t consider splitting if the information gain given by
      the best feature is below a minimum threshold
          • Can use the c2 test for statistical significance
   – Will also help when we have noisy samples…
 • We may prefer features with very high branching
     – e.g. Branch on the “universal time string” for Russell restaurant example
     –      Branch on social security number to look for patterns on who will get A
     – Solution: “gain ratio” --ratio of information gain with the attribute A to the
       information content of answering the question “What is the value of A?”
          • The denominator is smaller for attributes with smaller domains.
         Neural Network Learning
• Idea: Since classification is
  really a question of finding
  a surface to separate the
  +ve examples from the -ve
  examples, why not directly
  search in the space of
  possible surfaces?
• Mathematically, a surface
  is a function
    – Need a way of learning
      functions
    – “Threshold units”
“Neural Net” is a collection of threshold units
with interconnections
     I1
          w1         = 1 if w1I1+w2I2 > k                                     differentiable
                     = 0 otherwise

               t=k
    I2
          w2
                                  Feed Forward                          Recurrent
                              Uni-directional connections                Bi-directional
 Single Layer                                        Multi-Layer         connections




                                              Any “continuous”
  Any linear decision                         decision surface           Can act as
  surface can be represented                  (function) can be          associative
  by a single layer neural net                approximated to any        memory
                                              degree of accuracy by
                                              some 2-layer neural net
           The “Brain” Connection
A Threshold Unit               Threshold Functions



                                             differentiable




                            …is sort of like a neuron
Perceptron Networks


                  What happened to the
                  “Threshold”?
                   --Can model as an extra
                      weight with static input
                     I1
                        w1

                                  t=k
                      I2
                            w2
                                  ==
                                   I0=-1

                           w1      w0= k

                                 t=0

                           w2
Can Perceptrons Learn All Boolean Functions?
       --Are all boolean functions linearly separable?
Perceptron Learning as Gradient Descent
Search in the Space of Weights
        1
   E      (T  O) 2
        2 i
                                             2
              1                      
   E (W )          T  g  W j I j  
                                     
              2 i           j                              1
                                                  g ( x)                   ( sigmoid fn )
    E                                                    1  ex
          I j (T  O) g   W j I j 
                                      
   W j                      
                             j                    g ' ( x)  g ( x)(1  g ( x))
                               


                                                      Often a constant
                                         
   W j  W j  I j (T  O) g   W j I j 
                                                       learning rate parameter
                                                      is used instead
                           
                                j
                                       



                                                 Ij
                                                                                             I
                   Comparing Perceptrons and Decision Trees
                    in Majority Function and Russell Domain




              Majority function                                          Russell Domain




Majority function is linearly seperable..         Russell domain is apparently not....

                                            Encoding: one input unit per attribute. The unit takes as many
                                            distinct real values as the size of attribute domain
    This slide was shown in the class; but the next one is a better replacement…


         Max-Margin Classification &
          Support Vector Machines
•   Any line that separates the +ve &
    –ve examples is a solution
•   And perceptron learning finds one
    of them
     – But could we have a preference
       among these?
     – may want to get the line that
       provides maximum margin
       (equidistant from the nearest +ve/-
       ve)
          • The nereast +ve and –ve holding
            up the line are called support
            vectors
•   This changes the problem into an                     Support vectors
    optimization one
     – Quadratic Programming can be
       used to directly find such a line



                        Learning is Optimization after all!
        Max-Margin Classification &
         Support Vector Machines
•   Any line that separates the +ve &
    –ve examples is a solution
•   And perceptron learning finds one
    of them
     – But could we have a preference
       among these?
     – may want to get the line that
       provides maximum margin
       (equidistant from the nearest +ve/-
       ve)
          • The nereast +ve and –ve holding
            up the line are called support
            vectors
•   This changes the problem into an
    optimization one
     – Quadratic Programming can be
       used to directly find such a line



                        Learning is Optimization after all!
Lagrangian Dual
Linear Separability in High Dimensions




 “Kernels” allow us to consider separating surfaces in high-D
   without first converting all points to high-D
          11/30

Next class: Conclusion +
   Interactive review
Today: Statistical Learning
Kernelized Support Vector Machines
•   Turns out that it is not always
    necessary to first map the data into
    high-D, and then do linear separation
•   The quadratic programming
    formulation for SVM winds up using
    only the pair-wise dot product of
    training vectors
•   Dot product is a form of similarity
    metric between points
•   If you replace that dot product by
    any non-linear function, you will, in
    essence, be tranforming data into
    some high-dimensional space and
    then finding the max-margin linear
    classifier in that space
     –   Which will correspond to some
         wiggly surface in the original
         dimension
•   The trick is to find the RIGHT
    similarity function
     –   Which is a form of prior knowledge
Kernelized Support Vector Machines
•   Turns out that it is not always
    necessary to first map the data into
    high-D, and then do linear separation
•   The quadratic programming
    formulation for SVM winds up using
    only the pair-wise dot product of
    training vectors
•   Dot product is a form of similarity
    metric between points
•   If you replace that dot product by
    any non-linear function, you will, in
    essence, be tranforming data into                                                        0
    some high-dimensional space and           Polynomial Kernel: K (A ; A ) = (( 100 à 1)( 100 à 1) à 0:5) ï
                                                                         0        A         A              6


    then finding the max-margin linear
    classifier in that space
     –   Which will correspond to some
         wiggly surface in the original
         dimension
•   The trick is to find the RIGHT
    similarity function
     –   Which is a form of prior knowledge
                    Those who ignore easily available domain knowledge are
                    doomed to re-learn it…
                        Santayana’s brother
 Domain-knowledge & Learning
• Classification learning is a problem addressed by both people
  from AI (machine learning) and Statistics
• Statistics folks tend to ―distrust‖ domain-specific bias.
   – Let the data speak for itself…
   – ..but this is often futile. The very act of ―describing‖ the data points
     introduces bias (in terms of the features you decided to use to
     describe them..)
• …but much human learning occurs because of strong domain-
  specific bias..
• Machine learning is torn by these competing influences..
   – In most current state of the art algorithms, domain knowledge is
     allowed to influence learning only through relatively narrow
     avenues/formats (E.g. through ―kernels‖)
       • Okay in domains where there is very little (if any) prior knowledge (e.g.
         what part of proteins are doing what cellular function)
       • ..restrictive in domains where there already exists human expertise..
Multi-layer Neural Nets




                          How come back-prop
                          doesn’t get stuck in
                           local minima?

                          One answer: It is actually
                           hard for local minimas to
                           form in high-D, as the “trough”
                           has to be closed in all dimensions
Multi-Network Learning can learn Russell Domains




                 Russell Domain



       …but does it slowly…
     Practical Issues in Multi-layer
            network learning
• For multi-layer networks, we need to learn both
  the weights and the network topology
   – Topology is fixed for perceptrons
• If we go with too many layers and connections, we
  can get over-fitting as well as sloooow
  convergence
   – Optimal brain damage
      • Start with more than needed hidden layers as well as
        connections; after a network is learned, remove the nodes and
        connections that have very low weights; retrain
                                           Humans make 0.2%
                                           Neumans (postmen) make 2%


Other impressive applications:
 --no-hands across K-nearest-neighbor
    america             The test example’s class is determined
                        by the class of the majority of its k nearest
 --learning to speak
                             neighbors
                          Need to define an appropriate distance measure
                           --sort of easy for real valued vectors
                           --harder for categorical attributes
True hypothesis eventually dominates…
 probability of indefinitely producing uncharacteristic data 0
Bayesian prediction is optimal
 (Given the hypothesis prior,
   all other predictions are less likely)
Also, remember the Economist article that shows
 that humans have strong priors..
..note that the Economist article says humans are
 able to learn from few examples only because of priors..
So, BN learning is just probability estimation!
  (as long as data is complete!)
        How Well (and WHY) DOES NBC
                    WORK?
• Naïve bayes classifier is darned easy to implement
    – Good learning speed, classification speed
    – Modest space storage
    – Supports incrementality
• It seems to work very well in many scenarios
    – Lots of recommender systems (e.g. Amazon books recommender) use it
    – Peter Norvig, the director of Machine Learning at GOOGLE said, when
      asked about what sort of technology they use “Naïve bayes”
• But WHY?
    – NBC’s estimate of class probability is quite bad
        •   BUT classification accuracy is different from probability estimate accuracy
    – [Domingoes/Pazzani; 1996] analyze this
        Bayes Network Learning
• Bias: The relation between the class label and class
  attributes is specified by a Bayes Network.
• Approach                              RW   None   some   full
                                                                        Russell waits


   – Guess Topology                     T    0.3    0.2    0.5

   – Estimate CPTs                      F    0.4    0.3    0.3



• Simplest case: Naïve Bayes                               Wait time?          Patrons?   Friday?


   – Topology of the network is “class label” causes all the attribute
     values independently
   – So, all we need to do is estimate CPTs P(attrib|Class)
       • In Russell domain, P(Patrons|willwait)
            – P(Patrons=full|willwait=yes)=
              #training examples where patrons=full and will wait=yes
              #training examples where will wait=yes
   – Given a new case, we use bayes rule to compute the class label

           Class label is the disease; attributes are symptoms
  Naïve Bayesian Classification
• Problem: Classify a given example E into one of the
  classes among [C1, C2 ,…, Cn]
   – E has k attributes A1, A2 ,…, Ak and each Ai can take d
     different values
• Bayes Classification: Assign E to class Ci that
  maximizes P(Ci | E)
     P(Ci| E) = P(E| Ci) P(Ci) / P(E)
      • P(Ci) and P(E) are a priori knowledge (or can be easily extracted
        from the set of data)
• Estimating P(E|Ci) is harder
   – Requires P(A1=v1 A2=v2….Ak=vk|Ci)
      • Assuming d values per attribute, we will need ndk probabilities
• Naïve Bayes Assumption: Assume all attributes are
  independent P(E| Ci) = P P(Ai=vj | Ci )
   – The assumption is BOGUS, but it seems to WORK (and
     needs only n*d*k probabilities
NBC in terms of BAYES networks..




 NBC assumption
                   More realistic assumption
Estimating the probabilities for NBC
Given an example E described as A1=v1 A2=v2….Ak=vk we want to
   compute the class of E

    – Calculate P(Ci | A1=v1 A2=v2….Ak=vk) for all classes Ci and say
      that the class of E is the one for which P(.) is maximum
    – P(Ci | A1=v1 A2=v2….Ak=vk)                                 Common factor
            = P P(vj | Ci ) P(Ci) / P(A1=v1 A2=v2….Ak=vk)


Given a set of training N examples that have already been classified into n
   classes Ci
      Let #(Ci) be the number of examples that are labeled as Ci
      Let #(Ci, Ai=vi) be the number of examples labeled as Ci
          that have attribute Ai set to value vj
          P(Ci) = #(Ci)/N
          P(Ai=vj | Ci) = #(Ci, Ai=vi) / #(Ci)



                          USER PROFILE
 Example




P(willwait=yes) = 6/12 = .5                            Similarly we can show that
P(Patrons=“full”|willwait=yes) = 2/6=0.333             P(Patrons=“full”|willwait=no)
P(Patrons=“some”|willwait=yes)= 4/6=0.666                     =0.6666

   P(willwait=yes|Patrons=full) = P(patrons=full|willwait=yes) * P(willwait=yes)
                                   -----------------------------------------------------------
                                               P(Patrons=full)
                            = k* .333*.5
   P(willwait=no|Patrons=full) = k* 0.666*.5
       Using M-estimates to improve
           probablity estimates
  • The simple frequency based estimation of P(Ai=vj|Ck) can be
    inaccurate, especially when the true value is close to zero, and
    the number of training examples is small (so the probability that
    your examples don’t contain rare cases is quite high)
  • Solution: Use M-estimate
          P(Ai=vj | Ci) = [#(Ci, Ai=vi) + mp ] / [#(Ci) + m]
       – p is the prior probability of Ai taking the value vi
           • If we don’t have any background information, assume uniform
             probability (that is 1/d if Ai can take d values)
       – m is a constant—called ―equivalent sample size‖
           • If we believe that our sample set is large enough, we can keep m small.
             Otherwise, keep it large.
           • Essentially we are augmenting the #(Ci) normal samples with m more
             virtual samples drawn according to the prior probability on how Ai takes
             values
                – Popular values p=1/|V| and m=|V| where V is the size of the vocabulary



Also, to avoid overflow errors do addition of logarithms of probabilities
(instead of multiplication of probabilities)
        How Well (and WHY) DOES NBC
                    WORK?
• Naïve bayes classifier is darned easy to implement
    – Good learning speed, classification speed
    – Modest space storage
    – Supports incrementality
• It seems to work very well in many scenarios
    – Lots of recommender systems (e.g. Amazon books recommender) use it
    – Peter Norvig, the director of Machine Learning at GOOGLE said, when
      asked about what sort of technology they use “Naïve bayes”
• But WHY?
    – NBC’s estimate of class probability is quite bad
        •   BUT classification accuracy is different from probability estimate accuracy
    – [Domingoes/Pazzani; 1996] analyze this
Reinforcement Learning




Based on slides from Bill Smart
    http://www.cse.wustl.edu/~wds/
             What is RL?

 “a way of programming agents by reward and
punishment without needing to specify how the
             task is to be achieved”

                         [Kaelbling, Littman, & Moore, 96]
                  Basic RL Model

    1.   Observe state, st                 World
    2.   Decide on an action, at
    3.   Perform action
    4.   Observe new state, st+1   S   R                A
    5.   Observe reward, rt+1
    6.   Learn from experience
    7.   Repeat



Goal: Find a control policy that will maximize the observed
rewards over the lifetime of the agent
         An Example: Gridworld
Canonical RL domain                         +1

  •   States are grid cells
  •   4 actions: N, S, E, W
  •   Reward for entering top right cell
  •   -0.01 for every other move


Minimizing sum of rewards  Shortest path
  • In this instance
The Promise of Learning
            The Promise of RL
Specify what to do, but not how to do it
   • Through the reward function
   • Learning “fills in the details”


Better final solutions
   • Based of actual experiences, not programmer
     assumptions


Less (human) time needed for a good solution
       Learning Value Functions
We still want to learn a value function
   • We’re forced to approximate it iteratively
   • Based on direct experience of the world


Four main algorithms
   •   Certainty equivalence
   •   Temporal Difference (TD) learning
   •   Q-learning
   •   SARSA
         Certainty Equivalence
Collect experience by moving through the world
  • s0, a0, r1, s1, a1, r2, s2, a2, r3, s3, a3, r4, s4, a4, r5, s5, ...


Use these to estimate the underlying MDP
  • Transition function, T: SA → S
  • Reward function, R: SAS → 


Compute the optimal value function for this MDP
  • And then compute the optimal policy from it
     Temporal Difference (TD)
                                                [Sutton, 88]


TD-learning estimates the value function directly
  • Don’t try to learn the underlying MDP


Keep an estimate of Vp(s) in a table
  • Update these estimates as we gather more
    experience
  • Estimates depend on exploration policy, p
  • TD is an on-policy method
         TD-Learning Algorithm
Initialize Vp(s) to 0, s
Observe state, s
Perform action, p(s)
Observe new state, s’, and reward, r
Vp(s) ← (1-a)Vp(s) + a(r + gVp(s’))
Go to 2

0 ≤ a ≤ 1 is the learning rate
   •   How much attention do we pay to new experiences
                       TD-Learning
Vp(s) is guaranteed to converge to V*(s)
  • After an infinite number of experiences
  • If we decay the learning rate
                        

       a              a             
                                    2
              t                 t
       t 0              t 0


          c
  • at      will work
         ct

In practice, we often don’t need value convergence
  • Policy convergence generally happens sooner
          Actor-Critic Methods        [Barto, Sutton, & Anderson, 83]


TD only evaluates a particular policy                        Policy
  • Does not learn a better policy                           (actor)
                                                                        a
                                                                  V
                                                             Value
We can change the policy as we learn V                      Function
  • Policy is the actor                                      (critic)

  • Value-function estimate is the critic            s            r
                                                             World

Success is generally dependent on the starting
policy being “good enough”
                 Q-Learning             [Watkins & Dayan, 92]


Q-learning iteratively approximates the state-action
value function, Q
  • Again, we’re not going to estimate the MDP directly
  • Learns the value function and policy simultaneously


Keep an estimate of Q(s, a) in a table
  • Update these estimates as we gather more
    experience
  • Estimates do not depend on exploration policy
  • Q-learning is an off-policy method
          Q-Learning Algorithm
Initialize Q(s, a) to small random values, s, a
Observe state, s
Pick an action, a, and do it
Observe next state, s’, and reward, r
Q(s, a) ← (1 - a)Q(s, a) + a(r + gmaxa’Q(s’, a’))
Go to 2

0 ≤ a ≤ 1 is the learning rate
   •   We need to decay this, just like TD
              Picking Actions
We want to pick good actions most of the time, but
also do some exploration
  • Exploring means that we can learn better policies
  • But, we want to balance known good actions with
    exploratory ones
  • This is called the exploration/exploitation problem
                 Picking Actions
e-greedy
  • Pick best (greedy) action with probability e
  • Otherwise, pick a random action


Boltzmann (Soft-Max)
  • Pick an action based on its Q-value
                       Q(s, a) 
                               
                       t       
                  e
  • P(a | s)            Q(s, a' ) 
                                        , where t is the “temperature”
                                   
                 e
                 a'
                         t         
                       SARSA
SARSA iteratively approximates the state-action
value function, Q
  • Like Q-learning, SARSA learns the policy and the
    value function simultaneously


Keep an estimate of Q(s, a) in a table
  •   Update these estimates based on experiences
  •   Estimates depend on the exploration policy
  •   SARSA is an on-policy method
  •   Policy is derived from current value estimates
             SARSA Algorithm
Initialize Q(s, a) to small random values, s, a
Observe state, s
Pick an action, a, and do it (just like Q-learning)
Observe next state, s’, and reward, r
Q(s, a) ← (1-a)Q(s, a) + a(r + gQ(s’, p(s’)))
Go to 2

0 ≤ a ≤ 1 is the learning rate
   •   We need to decay this, just like TD
       On-Policy vs. Off Policy
On-policy algorithms
  • Final policy is influenced by the exploration policy
  • Generally, the exploration policy needs to be “close”
    to the final policy
  • Can get stuck in local maxima


Off-policy algorithms
  • Final policy is independent of exploration policy
  • Can use arbitrary exploration policies
  • Will not get stuck in local maxima
     Convergence Guarantees
The convergence guarantees for RL are “in the
limit”
   • The word “infinite” crops up several times


Don’t let this put you off
   • Value convergence is different than policy
     convergence
   • We’re more interested in policy convergence
   • If one action is really better than the others, policy
     convergence will happen relatively quickly
                     Rewards
Rewards measure how well the policy is doing
  • Often correspond to events in the world
     • Current load on a machine
     • Reaching the coffee machine
     • Program crashing
  • Everything else gets a 0 reward


Things work better if the rewards are incremental
  • For example, distance to goal at each step
  • These reward functions are often hard to design
         The Markov Property
RL needs a set of states that are Markov
  • Everything you need to know to make a decision is
    included in the state
  • Not allowed to consult the past    Not holding key
                                           Holding key

Rule-of-thumb                          K
  • If you can calculate the reward
    function from the state without
    any additional information,
    you’re OK
                                       S                 G
          But, What’s the Catch?
RL will solve all of your problems, but
   •   We need lots of experience to train from
   •   Taking random actions can be dangerous
   •   It can take a long time to learn
   •   Not all problems fit into the MDP framework
     Learning Policies Directly
An alternative approach to RL is to reward whole
policies, rather than individual actions
  • Run whole policy, then receive a single reward
  • Reward measures success of the whole policy


If there are a small number of policies, we can
exhaustively try them all
  • However, this is not possible in most interesting
    problems
       Policy Gradient Methods
Assume that our policy, p, has a set of n real-
valued parameters, q = {q1, q2, q3, ... , qn }
   • Running the policy with a particular q results in a
     reward, rq
                                     R
   • Estimate the reward gradient,        , for each qi
                                     θ i
                 R
   • θi  θi  a
                 θi

         This is another
          learning rate
      Policy Gradient Methods
This results in hill-climbing in policy space
   • So, it’s subject to all the problems of hill-climbing
   • But, we can also use tricks from search, like random
     restarts and momentum terms


This is a good approach if you have a
parameterized policy
   • Typically faster than value-based methods
   • “Safe” exploration, if you have a good policy
   • Learns locally-best parameters for that policy
An Example: Learning to Walk
                                             [Kohl & Stone, 04]

RoboCup legged league
  • Walking quickly is a big advantage


Robots have a parameterized gait controller
  • 11 parameters
  • Controls step length, height, etc.


Robots walk across soccer pitch and are timed
  • Reward is a function of the time taken
 An Example: Learning to Walk
Basic idea
  1. Pick an initial q = {q1, q2, ... , q11}
  2. Generate N testing parameter settings by perturbing q
       qj = {q1 + d1, q2 + d2, ... , q11 + d11},   di  {-e, 0, e}
  3. Test each setting, and observe rewards
       qj → rj
  4. For each qi  q                              d             
                                                             if θi largest     
                                                                              
       Calculate q1+, q10, q1- and set θ'i  θi  0              0
                                                             if θi largest     
                                                   d                        
  5.   Set q ← q’, and go to 2                              if θi largest     

                                                          Average reward
                                                          when qni = qi - di
An Example: Learning to Walk




     Initial                    Final


               Video: Nate Kohl & Peter Stone, UT Austin
Value Function or Policy Gradient?
When should I use policy gradient?
  • When there’s a parameterized policy
  • When there’s a high-dimensional state space
  • When we expect the gradient to be smooth


When should I use a value-based method?
  • When there is no parameterized policy
  • When we have no idea how to solve the problem
            Summary for Part I
Background
  • MDPs, and how to solve them
  • Solving MDPs with dynamic programming
  • How RL is different from DP
Algorithms
  •   Certainty equivalence
  •   TD
  •   Q-learning
  •   SARSA
  •   Policy gradient

								
To top