# Reinforcement learning Recurrent networks Markov Decision Process by MikeJenny

VIEWS: 4 PAGES: 5

• pg 1
```									Le 8, 2011-03-02                                                      Neural Networks and Learning Systems, TBMI26

Recurrent
networks

Reinforcement
learning

Learning
system
TBMI
26

Magnus Borga

2

Markov
Decision
Process
Markov
Decision
Process

The next state depends on the current state (input) and
the output of the system.                                                          State            Learning              Action
system

xt+1 = f(xt , at , et)
x                             a
Where the system gets xt+1 depends on where it is (xt )
and what it does (at ).
Environment

Actually an ordinary Markov process, since at = µ(xt) .         xt+1 = f(xt , at , et)                                   at = µ(xt)
3                                                                                            4

Cost
The
value
func>on

•  The
value
of
a
state
x
given
a
certain
policy
µ

g(x, a) – the cost for making action a in state x.                 is
the
accumulated
cost:

∞

g=1
J µ (x t ) = ∑ γ i g (x t +i , a t +i )
i =0

g = 0,1                        g=1                        •  γ
is
a
”discount
factor”
that
makes
costs

decrease
with
>me.

0
<
γ
≤
1.

g = 0.5

5

6

Magnus Borga                                                                                                                                                       1
Le 8, 2011-03-02                                                                   Neural Networks and Learning Systems, TBMI26

Accumulated
cost
Dynamic
programming

J (x)                       •  Given
a
Markov
decision
process,
ﬁnd
a

(sta>onary)
policy
µ
that
minimizes
the

?                                                                     accumulated
cost
J
for
all
ini>al
states
x0.

g=1

g = 0,1                                g=1
1                           0

1.4              g = 0.5

γ = 0.9

7                                                                                  8

Op>mal
policy
Richard
Bellman

•  Richard
Bellman
(1920–1984)
was
an
applied

The optimal policy µ* gives the smallest cost:
mathema>cian,
celebrated
for
his
inven>on
of

J µ * (x) = J * (x) ≤ J µ (x) ∀ x, µ                            dynamic
programming
in
1953,
and
important

∞                   ∞                                    contribu>ons
in
other
ﬁelds
of
mathema>cs.

J µ (x t ) = ∑ γ i g t += gt + ∑ γ i gt +i= gt + γ J µ (xt +1 )
i
i =0                i =1

*
J (x t ) = g (x t , µ * (x t )) + γ J * (x t +1 )   Next x

= min{g (xt , at ) + γ J * (xt +1 )}
a

Bellman’s optimality equation                     9                                                                                 10

Dynamic
programming
Reinforcement
learning

Jµ (x)                        •  On-­‐line
version
of
dynamic
programming.

•  The
states,
possible
ac>ons
and
corresponding

1.36
g=1
costs
turn
up
during
learning.

•  ”Neurodynamic
programming”

g = 0.1                                g=1
1                           0

1.4              g = 0.5

γ = 0.9

11                                                                                 12

Magnus Borga                                                                                                                                                         2
Le 8, 2011-03-02                                                                                                        Neural Networks and Learning Systems, TBMI26

Reinforcement
learning
Applica>ons

•  Ac>ve
systems
interac>ng
with
the

Learning                                                                environment,
e.g.
a
robot.

State                                            Action
system

•  Op>miza>on
of
unknown
cost
func>ons,
e.g.

x                       cost            a
g(x,a)
rou>ng.

Environment                                                             •  Can
become
beXer
than
the
teacher!

xt+1 = f(xt , at , et)                                                 at = µ(xt)
13                                                                                          14

Reinforcement
learning
Reinforcement
learning

•  The
is
deﬁned
by
a
scalar
func>on:
•  It
is
o[en
diﬃcult
to
tell
how
a
should

g(x,
a)
be
solved
but
easy
to
tell
if
or
how
good
it

i.e.
the
cost
of
taking
ac>on
a
in
state
x

has
been
solved.

•  The
system’s
goal
is
to
minimize
g
over
>me:

•  More
general
than
supervised
learning

•  The
system
must
ﬁnd
a
solu>on
by
itself!

∞
•  ”Learning
by
doing”
or
”trial
and
error”.

J = ∑ γ t g (t )
t =0

”We learn as we do and we do as well as we have learned.”
15                                                                                          16

Reinforcement
learning
Problem

Jµ (x)

“The
system
learned
to
properly
land
the

1.36
aircra[
with
a
rate
of
success
of
90%
to

1.9                    g=1                                                                                          96%
a[er
some
60,000
aXempts.”

g = 0.1                              1
g=1
0

1.4                  g = 0.5

γ = 0.9

17                                                                                          18

Magnus Borga                                                                                                                                                                                                        3
Le 8, 2011-03-02                                                                                  Neural Networks and Learning Systems, TBMI26

MENACE
Q-­‐learning

Match-box Educable Noughts And Crosses Engine
•  One
box
for
each
posi>on.

•  Each
box
ﬁlled
with
beans
of
diﬀerent
colours.
Qµ (x t , a t ) = g (x t , a t ) + γ J µ (x t +1 )
•  Each
colour
represent
a
certain
move.

•  Play
by
drawing
beans
and
move
according
to
Qµ(x,a) is the cost of first doing a and then follow the
the
colours.
policy µ.

•  If
the
system
wins,
new
beans
with
the
J * (xt ) = min{g (xt , at ) + γ J * (xt +1 )}
same
colours
as
the
ones
drawn.
a

= min{Q* (xt , at )}
•  If
the
systems
looses,
remove
the
drawn
a

beans.
Now, we don’t need a model of the environment (xt
19       +1 = f(xt,at)) to find the optimal response!                               20

Q-­‐learning
Q-­‐learning

Qµ (x t , a t ) = g (x t , a t ) + γ J µ (x t +1 )                        •  Think
of
the
Q-­‐func>on
as
a
table
with
values

of
each
possible
ac>on
in
each
possible
state.

Qµ (x t , a t ) = g (x t , a t ) + γ Qµ (x t +1 , a t +1 )
•  The
Q-­‐func>on
can,
however,
be

implemented
as
a
neural
network,
learning
a

Update the estimate of Q with                                                                 con>nuous
representa>on
from
the
sampled

states
and
ac>ons

ΔQt = η (g (x t , a t ) + γ Qµ (x t +1 , a t +1 ) − Qµ (x t , a t ) )

What Q should be                       What Q is now
21                                                                                  22

Q-­‐learning
The
Explora>on-­‐exploita>on
dilemma

a          Q                                                                           •  We
want
to
use
the
safest
strategy
to

1          1.9                                                                            minimize
the
accumulated
cost.

2          1.36                                 Jµ (x)               γ = 0.9

•  We
want
to
try
new
strategies
in
order
to

1                                                                        ﬁnd
a
beXer
one.

g=1                                                          •  Conﬂict
between
exploring
the
state
space

2
g=1                                              and
exploit
the
learnt
policy.

g = 0.1                       1                           0

1.4              g = 0.5
23                                                                                  24

Magnus Borga                                                                                                                                                                          4
Le 8, 2011-03-02                                                              Neural Networks and Learning Systems, TBMI26

The
credit
assignment
problem
The
temporal
credit
assignment
problem

•  Structural
•  Q-­‐learning
only
learns
one
step
at
a
>me.

-­‐
what
part
is
responsible?
•  Can
the
learning
speed
be
increased?

•  If
the
system
remembers
a
sequence
of
states,

•  Temporal
the
cost
for
the
whole
sequence
can
be

-­‐
when
was
the
crucial
ac>on
taken?
updated!

25                                                                           26

Magnus Borga                                                                                                                                                5

```
To top