# Introduction to reinforcement learning Reza Shadmehr

Document Sample

```					              580.691 Learning Theory

Reinforcement Learning 1: Generalized policy iteration
In a given situation the action an organism performs can be followed by a certain outcome,
which might be rewarding or non-rewarding to the organism. Also, the action brings the
organism into a new state, from which there may be again the possibility to obtain reward.

How do we learn to choose the actions to maximize reward? This is the problem addressed by
Reinforcement learning.
In contrast to supervised learning, the reward we see only tells us whether the action we chose
was good or bad, not what would have been the "correct" action. Also, the actions we take and
the reward can be separated temporally, so that the problem arises how to assign the reward
signal to actions. Thus, reinforcement learning has some aspects of supervised learning, but
with a very "poor" teacher.
Reinforcement Learning: The lay of the land

Say, you are a rat in a maze. At any given place you can go left or right, and you might
can food as a result:

wait

right                                                            left
left                   right

left                                                             right

The big circles are states. We say that the state at a time t has the value st, indicating
that the rat is at a certain location.
The small circles are actions, which the rat can take at any given state. Actions transport
the actor into a new state.
We define a policy p, a probabilistic mapping of states to action.
p :s a
p  s, a   P  a | s 
Reinforcement Learning: The lay of the land
Often, the outcome of actions is not full certain. For example, when going left the rat
might have a 10% chance to fall through a trap door. Once in the trap, the experimenter
might free the rat with a probability of 20% on every time step:

wait

right                                                           left
left                   right

P=0.9

left              P=0.1                                         right
P=0.8   P=0.2

wait

Thus, we define the transition probability, that you go to state s', when you are in state
s and perform the action a.
Pss '  P  st 1  s ' | st  s, at  a 
a
Reinforcement Learning: reward
r=8
wait

r=6
right                                                              left
left                   right

P=0.9

left                 P=0.1                                         right
P=0.8   P=0.2                              r=6

r=1                                    wait

If you are in a state s at time t and take action a, that brings you to another state at time
t+1, the you receive the reward rt+1.
In a discrete episode, Return will be defined as the sum of all the rewards we get from
now to time T.
Rt  rt 1  rt  2     rT
The goal of reinforcement learning is to find policy p that maximizes the expected return
from each state. This is the optimal policy.

We can also define the expected reward.

a '  E  rt 1 | st  s, st 1  s ', at  a 
ss
Reinforcement Learning: episodic and continuous environments

Now, life does not come in discrete episodes, but rather as a continuous stream of
behavior. If we want to define Return in a continuous case like this (where there is no end
T), we need to introduce temporal discounting. That is, reward we get right now is
much better than reward we will get tomorrow. The return of the reward way be decrease
exponentially.

Rt  rt 1   rt  2   rt 3 
2
   k rt  k 1
k 0

Temporal discounting can be demonstrated and shown in humans and animals: Given
the choice between 2 food pellets now and 4 food pellets in 2 hrs, which one does the rat
prefer. Would you rather have \$10 now or \$12 in a month?

With temporal discounting we can write each episodic environment as a continuous one,
by introducing nodes for the terminal states, that have a transition probability of 1 onto
themselves.
Value function and Bellman equations
Most reinforcement learning algorithms are based on estimating the Value function. The
Value function is the expected Return of a state under a certain policy.
 k               
V  s   Ep  Rt | st  s   Ep    rt  k 1 | st  s 
p

 k 0             
            

 Ep  rt 1     k rt  k  2 | st  s 
           k 0                      
 a            k                          
  p  s, a   P ss '   Ep    rt  k  2 | st 1  s ' 
a
ss '
a            s'                k 0                        
  p  s, a   Pss ' a '   V p  s ' 
a
 ss                 
a         s'
The iterative definition of the Value function is known as the Bellman equation. We can
write down a Bellman equation for each state. The value function then is the unique
solution to these equations. A value function also has a action-value function attached,
defined as the expected return, if performing action a from a state s.
Correspondingly we can define Q, the action-value function:

Qp  s, a   Ep  Rt | st  s, at  a 
  k                            
 E    rt  k 1 | st  s, at  a    Pss ' a '   V p  s ' 
a
 ss                 
 k 0                            s'
Evaluating policies
The first question a learner has to answer, is how good the current policy is. That is, we
need a method that evaluates the value-function for a policy:

Dynamic Programming

initialize Vip 1  0                                                        Dynamic programming works nice to
repeat                                                                        find a solution using a simple iterative
V  i 1 ( s )   p  s, a  Pss ' p 's    V  i   s  , s
a
 ss                        
scheme. For large state-spaces it can
a               s'                                        be beneficial to update only a subset of
                
until max V  i 1  V  i   
states in each iteration and then use
these to update other states.
BUT: in general dynamic programming
requires that we know the transition
probabilities P and the expected reward
R.
Evaluating policies: Experience Monte Carlo
Well, the learner often does not know the environment. So how can we learn the value-
function?
follow policy p for N steps.
keep track of reward and visits to each state k  s 
for each visit i to s calculate:
N
ˆ t 
R           t -ti rti 1                        Experience sampling using Monte Carlo
t  ti                                 can be quite time-intensive, but does
not require any knowledge of the
average over the first k visits to s (at t i ).    environment.
1 k ˆ  ti 
V s   R
ˆ
k i 1
Optimal Value function
Now that we have a value function, we can define the optimal value function. This is the
value function under the best policy, so that V p *  s   V p  s  , s

V p *  s   max Qp *  s, a 
a A ( s )

 max Ep *  rt 1   V p *  st 1  | st  s, at  a 
a

 max  Pss ' a '   V p *  s ' 
a
 ss                   
a
s'

The optimal action-value function would be:

  k                   
Q  s, a   E    rt  k 1 | st  s 
p*

 k 0                  
  Pss ' a '   max Qp *  s ', a ' 
a

s'
 ss         a'                

How do we find the optimal value function if we only know the value function of the current
policy?
Optimizing the policy

How do we find the optimal value function if we only know the value function of the current
policy? The key is to realize that if we change the policy at one state a from

p  s, a   p '  s, a 
Such that      V p '  s   V p  s  and otherwise follow p
Then: V p '  s   V p  s  , s

This is know as the policy improvement theorem. For example we can improve the
policy greedily, by choosing for every state s the action a:

p '  s   arg max Qp  s, a 
a

We can now iterated policy evaluation and policy improvement, until the policy doesn’t
change anymore. Due to the definition of the optimal Value function:

V p *  s   max Qp *  s, a 
aA( s )

By iterating on policy evaluation and policy improvement we find the optimal policy & value
function. This is know as generalized policy iteration.
Generalized policy iterations
We can alternate policy evaluations (E) and policy improvements (I), until convergence:

p                                            V p s
4.6                                    4.6
7.4                 8.8

5.7

4.6                                    4.6
5.7
7.8                                       7.8
14.3               12.2

9.8

7.8                                       7.8
6.5
Exploration vs. Exploitation
When the organism does not know the environment, but has to rely on sampling, the a
greedy policy can get in the way of finding the value function of the policy. That is Exploitation
can get in the way of Exploration.
This will be your homework 1.
Solution 1: without being maximally greedy, be e-greedy. That is, go for the maximum in 1-e
parts of the cases and choose other options e of the cases.
Solution 2: Do not go for maximum, but choose the probability following a softmax-function.

eQ ( a , s ) / 
p  at  a | st  s   p  a, s  
 eQ ( b , s ) / 
b

Sometimes this is called the Gibbs/Boltzman distribution. The parameter  determines, how
“soft” the selection is. If ->0 the softmax function approaches the maximal greedy selection.
 is often called the temperature of the distribution. As the temperature decreases, the
distribution “crystallizes” around one point, when the temperature rises, the distribution
becomes more and more diffuse.
For computational purposes, we can write the policy and the transition properties as
matrices. We borrow a formalism that is widely used for discrete stochastic processes.
The state s becomes a vector of indicator variables.
   p  at  a | st  s  
                       
axs

A   p  st 1  s | at  a  
                          
sxa

 I  st  1 
               
st          ...      
 I  st  S  
               
p  s t+1 | p , s t   A st
To do this we have to be careful how we define our actions. The probability for a
transition to s must ONLY depend on the action taken, not on the last state. That is,
when we have 5 states and can go left or right from each of them, we need to define
10 actions (go left from state 1 etc…..)

```
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
 views: 2 posted: 7/21/2011 language: English pages: 14