ITCS 6150/8150 Fall 2011 Jing Xiao
Forms of Learning (Chapter 18.1)
Supervised learning:
Learn a function from examples of its inputs and
outputs.
For fully observable environments, it will always
be the case that an agent can observe the effects of
its actions and can use supervised learning methods
to learn to predict them.
Unsupervised learning:
Learn patterns in the input when no specific output
values are supplied.
E.g., a taxi agent might gradually develop a
concept of “good traffic days” and “bad traffic
days” without ever being given labeled examples
of each.
Reinforcement learning:
Rather than being told what to do by a teacher, a
reinforcement learning agent must learn from
(feedback) reinforcement.
1
ITCS 6150/8150 Fall 2011 Jing Xiao
Statistical Learning Methods (Chapter 20)
Key concepts:
Data – evidence
Hypotheses – probabilistic theories of how the
domain works
Example: A bag of candy can be one of the five
kinds:
h1: 100% cherry
h2: 75% cherry and 25% lime
h3: 50% cherry and 50% lime
h4: 25% cherry and 75% lime
h5: 100% lime
The random variable H (for hypothesis), which is not
directly observable, denotes the type of the bag with
possible values h1 through h5.
Suppose P(H) =
As the pieces of candy are opened, data are revealed:
D1, D2, …, DN, where Di is random Boolean variable
with two possible values, cherry and lime. Let D
represent (a vector of) all the data, with observed
value d.
2
ITCS 6150/8150 Fall 2011 Jing Xiao
Problem: how to learn to predict the flavor of the
next piece of candy.
Bayesian learning: calculates the probability of each
hypothesis, given the data, and makes predictions on
that basis.
The predictions are made by using all the hypotheses,
weighted by their probabilities – learning is reduced
to probabilistic inference.
Suppose data observations are independently and
identically distributed (i.i.d.), then
P(d|hi) = j P(dj|hi)
Note that P(dj|hi) is known from the given info.
The probability of each hypothesis, given the data, is
thus:
P(hi|d) = P(d|hi)P(hi)
Now to make a prediction about the flavor of a new
piece of candy X (from the bag), we compute:
P(X|d) = ∑iP(X|d, hi)P(hi|d) = ∑i P(X|hi)P(hi|d)
3
ITCS 6150/8150 Fall 2011 Jing Xiao
That is, the prediction is the weighted average over
the predictions of individual hypotheses.
The above formula is obtained through conditioning
-- see Lecture 17, formula (A), and conditional
independence: P(X|d, hi) = P(X|hi) because of the
i.i.d. assumption.
Suppose the bag is really of type h5. Fig. 20 of page
804 shows how the prediction based on h5 increases
its dominance as more data are observed. That is, the
more data are observed, the more accurate the
prediction of the flavor of the next piece becomes.
(Initially, the prediction power of each hypothesis is
based on its prior probability.)
4
ITCS 6150/8150 Fall 2011 Jing Xiao
Therefore, true hypothesis eventually dominates the
Bayesian prediction!
Bayesian prediction is optimal but at a price: the
hypothesis space is usually very large or infinite.
Some approximate or simplified methods:
Maximum a posteriori (MAP) learning: make
predictions based on a single most probable
hypothesis, i.e., an hi that maximizes P(hi|d), called
hMAP.
Predictions made according to an MAP hypothesis is
approximately Bayesian to the extent that
P(X|d) ≈ P(X|hMAP)
Finding MAP hypotheses is often much easier than
Bayesian learning, because it requires solving an
optimization problem instead of a large summation
(or integration) problem.
Maximum-likelihood (ML) learning: assume a
uniform prior probability distribution P(hi) so that
MAP learning reduces to choosing an hi that
maximizes P(d|hi). Such an hi is denoted as hML.
5
ITCS 6150/8150 Fall 2011 Jing Xiao
ML learning is very common in statistics. It is a
reasonable approach when there is no reason to prefer
one hypothesis over another a priori. It provides a
good approximation to Bayesian and MAP learning
when the data set is large.
Parameter learning with complete data: involves
finding the numerical parameters for a probability
model whose structure is fixed. For example, learn
the conditional probabilities in a Bayesian network.
Maximum-likelihood parameter learning:
Discrete models
Again consider the example of a candy bag with lime
and cherry candies of unknown proportions.
Unknown parameter : the proportion of cherry
candies. The hypothesis is h. A uniform prob.
distribution of h can be assumed (i.e., all proportions
are equally likely – maximum-likelihood hypothesis).
Now the situation can be modeled by a simple
Bayesian network, and we are interested in learning
the probability:
Flavor: the flavor of a randomly chosen
candy from the bag, with value of either
cherry or lime.
6
ITCS 6150/8150 Fall 2011 Jing Xiao
The likelihood of c cherries (or N-c limes) after
unwrapping N candies is:
Under ML learning, we need to find the that
maximizes P(d|h). This can be obtained by
maximizing the log likelihood (which changes the
product to summation):
To find the that maximizes L(d|h), we do:
This shows that the actual proportion of cherries in
the bag is equal to the observed proportion in the
candies unwrapped so far!
To summarize, one standard method for
maximum-likelihood parameter learning:
7
ITCS 6150/8150 Fall 2011 Jing Xiao
Naïve Bayes models: the most common Bayesian
network model used in machine learning.
The “class” variable C is the root and the “attribute”
variables Xi (i.e., “wrapper” below) are the leaves.
…
The model is “naïve” because it assumes that the
attributes are conditionally independent of each other,
given the class.
Once the model parameters (the , 1, 2 above) are
learned, classification can be done for each set of
new attribute data with unknown class (i.e., the
Bayesian network can now be used):
8