# MDP Reinforcement Learning

Document Sample

```					MDP Reinforcement
Learning
Markov Decision
Process

“Should you
give money to           “Should you
charity?”               give money to
charity?”

“Would you
contribute?”          “Would you
contribute?”
\$
Charity MDP
 State space : 3 states
 Actions : “Should you give money to
charity” ,“Would you contribute”
 Observations : knowledge of current
state
 Rewards : in final state, positive
reward proportional to amount of
money gathered
   So how can we raise the most money
(maximize the reward)?

   I.e. What is the best policy?
– Policy : optimal action for each state
Lecture Outline
1.   Computing the Value Function

2.   Finding the Optimal Policy

3.   Computing the Value Function in an
Online Environment
Useful definitions
Define:  to be a policy
(j) : the action to take in j
R(j) the reward from a certain
state
f(j,) : the next state, starting
from state j and
performing action 
Computing The Value
Function
   When the reward is known, we can
compute the value function for a
particular policy

   V(j), the value function : Expected
reward for being in state j, and
following a certain policy 
Calculating V(j)
1.    Set V0 (j) = 0, for all j
2.    For i = 1 to Max_i
 Vi (j) = R(j) +  V(i-1) (f(j, (j)))

= the discount rate, measures how much future rewards
can propagate to previous states

Above formula depends on the rewards being known
Value Fn for
the Charity MDP

Fixing  at .5, and two policies, one which asks both
questions, and the other cuts to the chase

What is V3 if :

•   Assume that the reward is constant at the final state
(everyone gives the same amount of money)

2. Assume that if you ask if one should give to charity, the
reward is 10 times higher.
   Given the value function, how can we
find the policy which maximizes the
rewards?
Policy Iteration
1.     Set 0 to be an arbitrary policy
2.     Set i to 0
3.     Compute Vi(j) for all states j
4.     Compute (i+1)(j) = argmax  Vi(f(j,))
5.     If (i+1) = i stop, otherwise i++ and
back to step 3

What would this for the charity MDP for the two cases?
Lecture Outline
1.   Computing the Value Function

2.   Finding the Optimal Policy

3.   Computing the Value Function in an
Online Environment
MDP Learning
   So, the rewards are known, we can
calculate the optimal policy using
policy iteration.

   But what happens in the case where
we don’t know the rewards?
Lecture Outline
1.   Computing the Value Function

2.   Finding the Optimal Policy

3.   Computing the Value Function in an
Online Environment
Deterministic vs.
Stochastic Update
Deterministic :
Vi (j) = R(j) +  V(i-1) (f(j, (j)))

Stochastic :
V(n) = (1 - ) V(n) + [r + V(n’)]

   Difference in that stochastic version
averages over all visits to the state
MDP extensions
   Probabilistic state transitions

“Would you               ““Would you
.8   like to                  like to
.2
contribute?”             contribute?”

   How should you calculate the value
function for the first state now?
Probabilistic Transitions
   Online computation strategy works the
same even when state transitions are
unknown

   Works in the case when you don’t
know what the transitions are
Online V(j) Computation
1.    For each j initialize V(j) = 0
2.    Set n = initial state
3.    Set r = reward in state n
4.    Let n’ = f(n, (n))
5.    V(n) = (1 - ) V(n) + [r + V(n’)]
6.    n = n’, and back to step 3
1-step Q-learning
1.   Initialize Q(n,a) arbitraily
2.   Select  as policy
3.   n = initial state, r = reward, a = (n)
4.   Q(n,a) = (1 - ) Q(n,a) +
[r +  maxa’Q(n’,a’)]
5.   n = n’, and back to step 3

```
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
 views: 10 posted: 9/27/2011 language: English pages: 19
How are you planning on using Docstoc?