Embed
Email

learning

Document Sample

Shared by: ajizai
Categories
Tags
Stats
views:
2
posted:
12/1/2011
language:
English
pages:
8
ITCS 6150/8150 Fall 2011 Jing Xiao







Forms of Learning (Chapter 18.1)



 Supervised learning:

Learn a function from examples of its inputs and

outputs.



For fully observable environments, it will always

be the case that an agent can observe the effects of

its actions and can use supervised learning methods

to learn to predict them.



 Unsupervised learning:

Learn patterns in the input when no specific output

values are supplied.



E.g., a taxi agent might gradually develop a

concept of “good traffic days” and “bad traffic

days” without ever being given labeled examples

of each.



 Reinforcement learning:

Rather than being told what to do by a teacher, a

reinforcement learning agent must learn from

(feedback) reinforcement.









1

ITCS 6150/8150 Fall 2011 Jing Xiao







Statistical Learning Methods (Chapter 20)



Key concepts:

Data – evidence

Hypotheses – probabilistic theories of how the

domain works



Example: A bag of candy can be one of the five

kinds:

h1: 100% cherry

h2: 75% cherry and 25% lime

h3: 50% cherry and 50% lime

h4: 25% cherry and 75% lime

h5: 100% lime



The random variable H (for hypothesis), which is not

directly observable, denotes the type of the bag with

possible values h1 through h5.

Suppose P(H) =



As the pieces of candy are opened, data are revealed:

D1, D2, …, DN, where Di is random Boolean variable

with two possible values, cherry and lime. Let D

represent (a vector of) all the data, with observed

value d.







2

ITCS 6150/8150 Fall 2011 Jing Xiao





Problem: how to learn to predict the flavor of the

next piece of candy.



Bayesian learning: calculates the probability of each

hypothesis, given the data, and makes predictions on

that basis.



The predictions are made by using all the hypotheses,

weighted by their probabilities – learning is reduced

to probabilistic inference.



Suppose data observations are independently and

identically distributed (i.i.d.), then



P(d|hi) = j P(dj|hi)



Note that P(dj|hi) is known from the given info.



The probability of each hypothesis, given the data, is

thus:



P(hi|d) =  P(d|hi)P(hi)



Now to make a prediction about the flavor of a new

piece of candy X (from the bag), we compute:



P(X|d) = ∑iP(X|d, hi)P(hi|d) = ∑i P(X|hi)P(hi|d)

3

ITCS 6150/8150 Fall 2011 Jing Xiao





That is, the prediction is the weighted average over

the predictions of individual hypotheses.



The above formula is obtained through conditioning

-- see Lecture 17, formula (A), and conditional

independence: P(X|d, hi) = P(X|hi) because of the

i.i.d. assumption.



Suppose the bag is really of type h5. Fig. 20 of page

804 shows how the prediction based on h5 increases

its dominance as more data are observed. That is, the

more data are observed, the more accurate the

prediction of the flavor of the next piece becomes.

(Initially, the prediction power of each hypothesis is

based on its prior probability.)









4

ITCS 6150/8150 Fall 2011 Jing Xiao





Therefore, true hypothesis eventually dominates the

Bayesian prediction!



Bayesian prediction is optimal but at a price: the

hypothesis space is usually very large or infinite.



Some approximate or simplified methods:



Maximum a posteriori (MAP) learning: make

predictions based on a single most probable

hypothesis, i.e., an hi that maximizes P(hi|d), called

hMAP.



Predictions made according to an MAP hypothesis is

approximately Bayesian to the extent that

P(X|d) ≈ P(X|hMAP)



Finding MAP hypotheses is often much easier than

Bayesian learning, because it requires solving an

optimization problem instead of a large summation

(or integration) problem.



Maximum-likelihood (ML) learning: assume a

uniform prior probability distribution P(hi) so that

MAP learning reduces to choosing an hi that

maximizes P(d|hi). Such an hi is denoted as hML.





5

ITCS 6150/8150 Fall 2011 Jing Xiao



ML learning is very common in statistics. It is a

reasonable approach when there is no reason to prefer

one hypothesis over another a priori. It provides a

good approximation to Bayesian and MAP learning

when the data set is large.



Parameter learning with complete data: involves

finding the numerical parameters for a probability

model whose structure is fixed. For example, learn

the conditional probabilities in a Bayesian network.



Maximum-likelihood parameter learning:

Discrete models



Again consider the example of a candy bag with lime

and cherry candies of unknown proportions.

Unknown parameter : the proportion of cherry

candies. The hypothesis is h. A uniform prob.

distribution of h can be assumed (i.e., all proportions

are equally likely – maximum-likelihood hypothesis).

Now the situation can be modeled by a simple

Bayesian network, and we are interested in learning

the probability:



Flavor: the flavor of a randomly chosen

candy from the bag, with value of either

cherry or lime.



6

ITCS 6150/8150 Fall 2011 Jing Xiao





The likelihood of c cherries (or N-c limes) after

unwrapping N candies is:







Under ML learning, we need to find the  that

maximizes P(d|h). This can be obtained by

maximizing the log likelihood (which changes the

product to summation):





To find the  that maximizes L(d|h), we do:







This shows that the actual proportion of cherries in

the bag is equal to the observed proportion in the

candies unwrapped so far!



To summarize, one standard method for

maximum-likelihood parameter learning:









7

ITCS 6150/8150 Fall 2011 Jing Xiao





Naïve Bayes models: the most common Bayesian

network model used in machine learning.



The “class” variable C is the root and the “attribute”

variables Xi (i.e., “wrapper” below) are the leaves.



















The model is “naïve” because it assumes that the

attributes are conditionally independent of each other,

given the class.



Once the model parameters (the , 1, 2 above) are

learned, classification can be done for each set of

new attribute data with unknown class (i.e., the

Bayesian network can now be used):









8



Related docs
Other docs by ajizai
Fall 2010
Views: 0  |  Downloads: 0
Math 111
Views: 0  |  Downloads: 0
Training_listing_275360_7
Views: 1  |  Downloads: 0
C4-051739
Views: 0  |  Downloads: 0
DEFINITIONS
Views: 0  |  Downloads: 0
Unit POPULATIONS
Views: 0  |  Downloads: 0
albhed
Views: 0  |  Downloads: 0
price_list
Views: 9  |  Downloads: 0
By registering with docstoc.com you agree to our
privacy policy

You are almost ready to download!

You are almost ready to download!