VIEWS: 3 PAGES: 37

• pg 1
```									Reinforcement Learning

Yishay Mansour
Tel-Aviv University
Outline
• Goal of Reinforcement Learning
• Mathematical Model (MDP)
• Planning

2
Goal of Reinforcement Learning
Goal oriented learning through interaction

Control of large scale stochastic environments with
partial knowledge.

Supervised / Unsupervised Learning
Learn from labeled / unlabeled examples

3
Reinforcement Learning - origins
Artificial Intelligence

Control Theory

Operation Research

Cognitive Science & Psychology

Solid foundations; well established research.
4
Typical Applications
• Robotics
– Elevator control [CB].
– Robo-soccer [SV].
• Board games
– backgammon [T],
– checkers [S].
– Chess [B]
• Scheduling
– Dynamic channel allocation [SB].
– Inventory problems.
5
Contrast with Supervised Learning

The system has a “state”.

The algorithm influences the state distribution.

6
Mathematical Model - Motivation

Model of uncertainty:
Environment, actions, our knowledge.

Focus on decision making.

Maximize long term reward.

Markov Decision Process (MDP)
7
Mathematical Model - MDP

Markov decision processes

S- set of states

A- set of actions

d - Transition probability

R - Reward function          Similar to DFA!
8
MDP model - states and actions
Environment = states

0.7

0.3          action a

Actions = transitions          d (s, a, s' )
9
MDP model - rewards

R(s,a) = reward at state s
for doing action a
(a random variable).

Example:
R(s,a) = -1 with probability 0.5
+10 with probability 0.35
+20 with probability 0.15                           10
MDP model - trajectories

trajectory:
s0 a0   r0   s1 a1   r1   s2 a2   r2
11
MDP - Return function.

Combining all the immediate rewards to a single value.

Modeling Issues:

Are early rewards more valuable than later rewards?

Is the system “terminating” or continuous?

Usually the return is linear in the immediate rewards.
12
MDP model - return functions
Finite Horizon - parameter H           return      R(s , a )
1 i  H
i   i

Infinite Horizon


discounted - parameter g<1.           return   γ i R(s i ,a i )
i0

N 1
1
undiscounted
N

i0
N
R(s i ,a i )     return


Terminating MDP
13
MDP model - action selection
AIM: Maximize the expected return.

Fully Observable - can “see” the “entire” state.

Policy - mapping from states to actions

Optimal policy: optimal from any start state.

THEOREM: There exists a deterministic optimal policy
14
Contrast with Supervised Learning

Supervised Learning:
Fixed distribution on examples.

Reinforcement Learning:
The state distribution is policy dependent!!!

A small local change in the policy can make a huge
global change in the return.

15
MDP model - summary

sS      - set of states, |S|=n.
a A     - set of k actions, |A|=k.
d ( s1 , a, s2 ) - transition function.

R(s,a)          - immediate reward function.
 :S  A           - policy.


g
i0
i
ri   - discounted cumulative return.

16
Simple example: N- armed bandit

Single state.             Goal: Maximize sum of
immediate rewards.
a1
Given the model:
s           a2        Greedy action.

a3             Difficulty:
unknown model.
17
N-Armed Bandit: Highlights
• Algorithms (near greedy):
– Exponential weights
• Gi sum of rewards of action ai
• wi = eGi
• Results:
– For any sequence of T rewards:
– E[online] > maxi {Gi} - sqrt{T log N}
18
Planning - Basic Problems.

Given a complete MDP model.

Policy evaluation - Given a policy , estimate its return.

Optimal control -    Find an optimal policy * (maximizes
the return from any start state).

19
Planning - Value Functions

V(s) The expected return starting at state s following .

Q(s,a) The expected return starting at state s with
action a and then following .

V*(s) and Q*(s,a) are define using an optimal policy *.

V*(s) = max V(s)
20
Planning - Policy Evaluation
Discounted infinite horizon (Bellman Eq.)
V(s) = Es’~  (s) [ R(s, (s)) + g V(s’)]

Rewrite the expectation

V ( s )  E[ R ( s,  ( s ))]  g s ' d ( s,  ( s ), s ' )V ( s ' )
                                                           

Linear system of equations.
21
Algorithms - Policy Evaluation
Example

A={+1,-1}             s0                    s1
g = 1/2                    0            1
d(si,a)= si+a
 random

"a: R(si,a) = i             3            2
s3                     s2

V(s0) = 0 +g [(s0,+1)V(s1) + (s0,-1) V(s3) ]
22
Algorithms -Policy Evaluation
Example

A={+1,-1}                                      V(s0) = 5/3
s0                    s1
g = 1/2                                        V(s1) = 7/3
0            1
d(si,a)= si+a                                  V(s2) = 11/3
 random                                       V(s3) = 13/3

"a: R(si,a) = i         3            2
s3                     s2

V(s0) = 0 + (V(s1) + V(s3) )/4
23
Algorithms - optimal control
State-Action Value function:

Q(s,a)  E [ R(s,a)] + gEs’~ (s,a) [ V(s’)]

Note    V  ( s )  Q  ( s ,  ( s ))

For a deterministic policy .

24
Algorithms -Optimal control
Example

A={+1,-1}                                    Q(s0,+1) = 5/6
s0                    s1
g = 1/2                                      Q(s0,-1) = 13/6
0            1
d(si,a)= si+a
 random

R(si,a) = i           3            2
s3                     s2

Q(s0,+1) = 0 +g V(s1)
25
Algorithms - optimal control
CLAIM: A policy  is optimal if and only if at each state s:

V(s)  MAXa {Q(s,a)}               (Bellman Eq.)

PROOF: Assume there is a state s and action a s.t.,

V(s) < Q(s,a).
Then the strategy of performing a at state s (the first time)
is better than .
This is true each time we visit s, so the policy that
performs action a at state s is better than .          p       26
Algorithms -optimal control
Example

A={+1,-1}       s0                 s1
g = 1/2              0         1
d(si,a)= si+a
 random

R(si,a) = i          3         2
s3                  s2

Changing the policy using the state-action value function.
27
Algorithms - optimal control

The greedy policy with respect to Q(s,a) is

(s) = argmaxa{Q(s,a) }

The e-greedy policy with respect to Q(s,a) is

(s) = argmaxa{Q(s,a) } with probability 1-e, and

(s) = random action with probability e       28
MDP - computing optimal policy

1. Linear Programming
2. Value Iteration method.

V i 1 ( s)  max{R( s, a)  g s ' d (s, a, s' ) V i (s' )}
a

3. Policy Iteration method.
 i 1
 i ( s)  arg max {Q                ( s, a)}
a
29
Convergence
• Value Iteration
– Drop in distance from optimal
maxs {V*(s) – Vt(s)}
• Policy Iteration
– Policy can only improve
"s Vt+1(s)  Vt(s)
– Less iterations then Value Iteration, but
more expensive iterations.

30
Relations to Board Games
•   state = current board
•   action = what we can play.
•   opponent action = part of the environment
•   value function = probability of winning
•   Q- function = modified policy.
•   Hidden assumption: Game is Markovian

31
Planning versus Learning

Tightly coupled in Reinforcement Learning

Goal: maximize return while learning.

32
Example - Elevator Control
Learning (alone):
Model the arrival model well.

Planning (alone) :
Given arrival model build schedule

Real objective: Construct a
schedule while updating model

33
Partially Observable MDP
Rather than observing the state we observe some
function of the state.
Ob - Observable function.
a random variable for each states.

Example: (1) Ob(s) = s+noise. (2) Ob(s) = first bit of s.

Problem: different states may “look” similar.

The optimal strategy is history dependent !

34
POMDP - Belief State Algorithm
Given a history of actions and observable value
we compute a posterior distribution for the state
we are in (belief state).

The belief-state MDP:
States: distribution over S (states of the POMDP).
actions: as in the POMDP.
Transition: the posterior distribution (given the observation)

We can perform the planning and learning on the belief-state MDP.
35
POMDP -
Hard computational problems.
Computing an infinite (polynomial) horizon undiscounted
optimal strategy for a deterministic POMDP is P-space-
hard (NP-complete) [PT,L].

Computing an infinite (polynomial) horizon undiscounted
optimal strategy for a stochastic POMDP is EXPTIME-
hard (P-space-complete) [PT,L].

Computing an infinite (polynomial) horizon
undiscounted optimal policy for an MDP is
P-complete [PT] .
36
Resources
• Reinforcement Learning (an introduction)
[Sutton & Barto]
• Markov Decision Processes [Puterman]
• Dynamic Programming and Optimal
Control [Bertsekas]
• Neuro-Dynamic Programming [Bertsekas &
Tsitsiklis]
• Ph. D. thesis - Michael Littman
37

```
To top