Hierarchical Relational Reinforcement Learning by MikeJenny

VIEWS: 8 PAGES: 12

									                 Hierarchical Relational Reinforcement Learning

                                                Meg Aycinena
                                        Department of Computer Science
                                              Stanford University
                                          Stanford, California, 94305
                                           aycinena@cs.stanford.edu

                       Abstract                                  goals such as clearing one block, or placing one block
                                                                 on top of another. However, when faced with a more
   Experiments with relational reinforcement                     complex goal such as building an ordered tower of
   learning, a combination of reinforcement                      blocks, relational reinforcement learning works less
   learning and inductive logic programming, are                 effectively than its traditional reinforcement learning
   presented. This technique offers greater                      precursor.[Meg: isn‟t this comparing apples with
   expressive power than that offered by traditional             oranges?     The “traditional reinforcement learning
   reinforcement learning. We use it to find general             precursor” led to huge state-action tables that could only
   solutions in the blocks world domain. We                      solve problems with specific, named blocks. RRL could
   discuss some difficulties associated with                     solve problems with arbitrarily named blocks.]
   relational reinforcement learning, specifically its
   decreased effectiveness when presented with                   The introduction of a hierarchical learning strategy may
   more complex problems. Finally, we explore                    harness the effectiveness of relational reinforcement
   ways in which hierarchical methods and learning               learning in solving simpler problems to the solution of
   subroutines may be able to overcome some of                   more complicated goals. Our hypothesis states that by
   these weaknesses.                                             first learning a set of useful subroutines and then
                                                                 learning to combine these subroutines to achieve a
1. Introduction                                                  stated goal, an agent may potentially converge more
                                                                 consistently and more quickly upon an optimal solution.
Reinforcement learning can be a powerful machine
learning technique, but it suffers from an inability to          Some work has been done on hierarchical learning in
generalize well over large state spaces, or to reuse the         traditional reinforcement learning [Kaelbling et al.,
results of one learning session on a similar, but slightly       1996], but has been hindered due to the lack of
different or more complicated problem. Džeroski et al.           variablized solutions to subproblems and the inability to
[Džeroski et al., 1998], [Džeroski et al., 2001], have           generalize those solutions to slightly different problems.
developed a way to combine reinforcement learning                Because relational reinforcement learning can produce
with inductive logic programming, to produce a more              variablized solutions to subproblems and because it has
flexible, expressive approach that circumvents many of           been shown to perform well on simpler problems,
the problems that have plagued traditional                       hierarchical learning is easier to implement and possibly
reinforcement learning. Specifically, the method                 more powerful than its traditional counterpart.
modifies the traditional Q-learning technique so that it         Hierarchical relational reinforcement learning promises
stores the Q-values of state-action pairs in a top-down          to exploit the advantages and overcome some of the
induced logical decision tree. The method allows the             weaknesses of relational reinforcement learning in order
use of variables, eliminates the need for an explicit,           to improve the actual process of learning, as well as the
memory-expensive state-action table, and generalizes             applicability of the learned solution. In this paper, we
already learned knowledge to unseen parts of the state           investigate whether this actually occurs.
space.
                                                                 The paper is organized as follows. In Section 2 we
Although relational reinforcement learning greatly               describe traditional reinforcement learning, Q-learning,
improves upon traditional reinforcement learning, it             and how P-learning extends it. In Section 3 we briefly
tends to work less effectively with more complicated             summarize inductive logic programming, and describe
goals. Its problem solving ability in the classic blocks         top-down induction of logical decision trees, as
world domain, for example, is impressive with simple             implemented in the TILDE system, [Blockeel, 1998]. In



                                                             1
Section 4 we describe relational reinforcement learning,         results they produce. In Q-learning, each possible state-
and how we have implemented it according to the                  action pair is assigned a quality value, or “Q-value” and
design of [Džeroski et al., 2001]. Section 5 describes a         actions are chosen by selecting the action in a particular
simple implementation strategy for hierarchical                  state which has the highest Q-value.
relational reinforcement learning, and Section 6 briefly
offers the results of the preliminary experiments we             The process works as follows. The agent can be in one
have conducted in order to explore its effectiveness.            state of the environment at a time. In a given state si 
Section 7 concludes and discusses possible further               S, the agent selects an action ai  A to execute
work.                                                            according to its policy π. Executing this action puts the
                                                                 agent in a new state si+1, and it receives the reward ri+1
2. Traditional reinforcement learning, Q-                        associated with the new state. The policy in Q-learning
learning, and P-learning                                         is given by:

The term “reinforcement learning” and some of the                     π(s) = arg maxa Q(s, a),and thus by learning an
general characteristics of the technique have been               optimal function Q*(s, a) such that we maximize the
borrowed from cognitive psychology, but in recent                total reward r, we can learn the optimal policy π*(s).
years it has become increasingly popular within the              We update Q(s, a) after taking each action so that it
computer science disciplines of artificial intelligence          better approximates the optimal function Q*(s, a). We
and machine learning. See [Kaelbling et al., 1996] for a         do this with the following equation:
broad survey of the field as it has been applied within
computer science. In its simplest form, reinforcement                Q (si, ai) := ri +  maxa’ Q (si+1, a’),
learning describes any machine learning technique in
which a system learns a policy to achieve a goal through         [Meg: you should explain somewhere what a’ refers to.
a series of trial-and-error training sessions, receiving a       And you probably want a' instead of a’. That is „prime‟
reward or punishment after each session, and learning            instead of a single closed-quote mark.]
from this “reinforcement” in future sessions. (Note: we
shall use the terms “training session” and “episode”
interchangeably to refer to one cycle of state-action            where  is the discount factor, 0 <  < 1.
sequences culminating in a goal state.)
                                                                 In the first few training sessions, the agent has not yet
Formally, [Kaelbling et al., 1996] describe the                  learned even a crude approximation of the optimal Q-
reinforcement learning problem as follows:                       function, and therefore the initial actions will be random
                                                                 if the initial Q‟s are. However, due to this randomness,
        a discrete set of environment states, S,                the danger exists that, although the Q-function may
        a discrete set of agent actions, A, and                 eventually converge to a solution to the problem, it may
        a set of scalar reinforcement signals, typically        not find the optimal solution to the problem, because the
         {0, 1}, or the real numbers.                            initial path it stumbles upon may be a correct but sub-
                                                                 optimal path to the goal. Therefore it is necessary that
The goal of the agent is to find a policy π, mapping             every reinforcement learning agent have some policy of
states to actions, that maximizes some long-run measure          exploration of its environment, so that, if a better
of reinforcement. We will denote the optimal policy as           solution exists than the one it has already found, it may
π*. Various techniques exist within reinforcement                stumble upon it.
learning for choosing an action to execute in a particular
state. Q-learning is one of the more traditional of these        One popular method (e.g. [Džeroski et al., 2001]) for
techniques.                                                      arbitrating the tradeoff between exploitation (using
                                                                 knowledge already learned) and exploration (acting
Q-learning                                                       randomly in the hopes of learning new knowledge) is to
                                                                 choose each action with a probability proportionate to
Q-learning falls within the “model-free” branch of the           its Q-value. Specifically:
reinforcement learning family, because it does not
require the agent to learn a model of the world or                   Pr (ai | s) = T-Q (s, ai) / j T-Q (s, aj) .
environment (i.e. how actions lead from state to state,
and how states give rewards) in order to learn an                [Meg: can you fix the subscripts so they aren‟t so low?]
optimal policy. Instead, the agent interacts with the            The temperature, T, starts at a high value and is
environment by executing actions and perceiving the              decreased as learning goes on. At the beginning, the



                                                             2
high value of T ensures that some actions associated             relationships from a set of specific data instances, hence
with low Q-values are taken – encouraging exploration.           the use of the term “inductive”. This contrasts with the
As the temperature is decreased, most actions are those          well-established field of deductive reasoning, in which a
associated with high Q-values – encouraging                      conclusion is inferred from a given set of premises
exploitation. This is the method of choosing actions that        through the application of accepted logical rules. In
we will use.                                                     most cases, ILP has been used to learn concepts so that
                                                                 a set of unseen examples may be classified correctly.
P-learning
                                                                 Decision trees
Like Q-learning, P-learning is a technique for choosing
actions in a particular state, presented by [Džeroski et         One popular method for performing the classification of
al., 2001]. Indeed, P-learning is almost identical to Q-         examples is the decision tree. A decision tree consists of
learning, except that rather than remembering a real-            a set of nodes, each of which is either an internal node
valued scalar to represent the quality of each state-            (meaning it has a set of other nodes as its children), or a
action pair, the agent only remembers a 1 if the action in       leaf node (in which case it has no children).
the given state is optimal, and a 0 if the action is
nonoptimal. Formally:                                            Each internal node contains a test that is used to split the
                                                                 possible set of examples into several mutually exclusive
    if a  π*(s) then P(s, a) = 1 else P(s, a) = 0.              groups. The node then passes the current example down
                                                                 the branch corresponding to the result of the test. Often,
[Meg: since you are talking about optimal actions here,          the tests return Boolean results, so the tree is binary,
shouldn‟t you be using P*? And, what‟s the rationale             with each node classifying the example as either passing
(throughout) for your use of boldface symbols?]                  or failing its particular test. Decision trees with this
                                                                 property are called binary decision trees.
In practice, the P function is computed from the Q
function (which must also be learned), but it can be             Leaf nodes do not contain tests, but rather hold the
stored more compactly and used later more easily. The            classification label for examples that are sorted into that
P function is expressed in terms of the Q function as            node. Thus an example can be deterministically
follows:                                                         classified by a decision tree.

    if a  arg maxa Q(s, a) then P(s, a) = 1 else                Logical decision trees
                 P(s, a) = 0.
                                                                 According to [Blockeel and De Raedt, 1997], logical
We can also incorporate exploration into P-learning              decision trees are binary decision trees which fulfill the
through the analogous equation to the one for choosing           following two properties:
actions in Q-learning:
                                                                         The tests in the internal nodes are first-order
    Pr (ai | s) =   T-P (s, ai)   /   jT
                                          -P (s, a )
                                                  j    .                  logical conjunctions of literals.
                                                                         A variable that is introduced in a particular
[Fix subscripts]                                                          node (i.e. one which does not exist higher in
                                                                          the tree), can only occur in the subtree
We will see that P-learning, although it is built directly                corresponding to successful tests.
on top of Q-learning and may seem rather redundant,
produces smaller, simpler, and more efficient                    The second requirement is necessary due to the fact that
representations of the optimal solution and therefore is         if a test referring to a newly introduced variable fails,
extremely valuable towards our goal of creating                  the agent cannot be sure that the variable refers to
hierarchical programs.                                           anything at all, and therefore it does not make sense to
                                                                 continue to refer to it.
3. Inductive logic programming and top-                          We illustrate the concept of logical decision trees with
down induction of logical decision trees                         this simple example, taken from [Džeroski et al., 2001]:

Inductive logic programming (ILP) refers to a
subdiscipline of machine learning that deals with
learning logic programs. ILP attempts to use reasoning
to derive useful and general principles, rules, and



                                                             3
                                                                 not(N.test) are not necessarily complementary for all
           clear(a)                                              possible substitutions. This distinction does not exist in
                                                                 the propositional case. For a more detailed description
                                                                 of the logical inequivalence of the two queries, see
   false              clear(b)                                   ([Blockeel and De Raedt, 1997], p.4-6).

                                                                 [Meg: I didn‟t find the bulleted items and the paragraph
              false            clear(c)                          that followed them very clear. As a matter of fact, I
                                                                 don‟t think I understand them at all. If you are
                                                                 assuming that the reader has read and understands
                    false                 true                   Blockeel and De Raedt, this might be ok as a quick
                                                                 summary. But if you want your paper to be at all self-
This logical decision tree determines whether all the            contained (which I think would be a good idea), it
blocks are unstacked in the 3-blocks world.                      would be important to explain all this. And an example
                                                                 of a tree with tests and queries would be important. (I
[Meg: All of your other figures have numbers. How                don‟t understand the difference between a test and a
come this one doesn‟t?]                                          query.] I‟d be glad to have you explain all this to me in
                                                                 person, and then maybe I could give some advice about
The fact that we are using variables also necessitates the       explaining it in writing.]
use of more complicated tests than those used in the
propositional case. Internal nodes must contain a query          Thus logical decision trees allow the classification of
including both the unique predicate introduced in the            examples described in first order (rather than
current node, and the predicates introduced in all the           propositional or attribute value) logic, and therefore
parent nodes of the current node. The following process          offer much greater expressiveness and flexibility than
for associating clauses and queries to nodes is adapted          traditional decision trees. This characteristic makes
from [Blockeel and De Raedt, 1997]. We assume that               them ideal candidates for use in relational reinforcement
examples sorted down the tree are passed to the left             learning, which seeks to allow variablization and
subtree if the node‟s test returns true, and to the right        generalization over similar inputs.
subtree if it returns false. [Meg: This convention is
different than the one used in the figure above. Is that a       Top-down induction of logical decision trees
problem?] We also assume that N.test returns the unique
predicate assigned to the node N:                                In order to use logical decision trees for classification,
                                                                 however, we must have an algorithm for inducing them
        With the root node T, the clause c0 ← T.test            from a set of example instances. [Blockeel and De
         and the empty query are associated.                     Raedt, 1997] offer the following algorithm, based on the
        For every internal node N with associated               classic TDIDT (top-down induction of decision trees)
         clause ci ← P and associated query Q, the               algorithm, but modified to induce logical decision trees.
         query P is associated with its left subtree. The        It assumes as input a set of labeled examples E,
         query Q, not(ci) is associated with the right           background knowledge B, and a language L specifying
         subtree.                                                what kind of tests may be used in the decision tree. Both
        With each internal node N with associated               B and L are assumed to be available implicitly.
         query Q, the clause ci ← Q, N.test is
         associated; i is a number unique for this node.             buildtree (T, E, Q):
        Each leaf node N has no associated test, but                    if E is homogeneous enough
         only the query Q that it inherited from its                     then
         parent.                                                              K := most_frequent_class (E)
                                                                              T := leaf (K)
It is important to note that we associate the query Q,
                                                                         else
not(ci) with the right subtree, and not the expected query
                                                                              Qb := best element of ρ(Q), according to
Q, not(N.test), which would be symmetric with the left
                                                                         some heuristic
subtree. This is necessary because the queries of the left
                                                                              T.test := C‟ where Qb ← Q, C’
and right subtrees must be complementary; only one of
the two queries associated with the two children of each                      E1 := { E  E | E U B |= Qb }
node must succeed on any example passing through that                         E2 := { E  E | E U B |≠ Qb }
node. Because it is possible that the N.test shares                           buildtree (T.left, E1, Q’ )
variables with Q, the two queries Q, N.test and Q,                            buildtree (T.right, E2, Q )


                                                             4
     [Replace quotes by primes in above (and                       Thus relational reinforcement learning offers the
throughout)]                                                       potential to overcome many of the serious drawbacks
The refinement operator ρ determines the language bias             that traditional reinforcement learning has presented in
(and thus serves the role of the language L). [Blockeel            the past.
et. al., 2001] use a refinement operator that adds one or
more literals to the clause under θ-subsumption. A                 Modifications to traditional Q- learning
clause c1 subsumes another clause c2 if and only if there
is a substitution θ such that c1 θ  c2. In other words, a         Relational reinforcement learning as a technique is a
refinement operator under θ-subsumption maps a clause              more specific version of its traditional Q-learning roots.
onto a set of clauses, each of which is subsumed by the            As in reinforcement learning, the algorithm has
initial clause. It is from this set of new clauses that the        available to it:
conjunction to be put in a new node is chosen.
                                                                           a set of possible states, S,
[Can you think of a short example to illustrate how the                    a set of agent actions, A,
algorithm works?]                                                          a transition function which, although it may not
                                                                            be explicitly known by the algorithm, will
The TILDE system, which [Blockeel and De Raedt,                             return the next state given the current state and
1997], [Blockeel, 1998], and [Blockeel et. al., 2001]                       an action to perform.
introduce and describe, implements the algorithm                           a reward function that returns scalar
above. It can be used to induce both classification and                     reinforcement signals given a state.
regression logical decision trees, using the equivalent
regression tree system TILDE-RT. (Regression trees are             In relational reinforcement learning, however, more
identical in definition to classification trees, except that       specific rules govern the learning process. States are
they contain real-valued numbers in their leaves rather            represented as sets of first-order logical facts, and the
than classifier labels.) We will see how logical decision          algorithm can only see one state at a time. Actions are
trees, induced with the TILDE system, can be used to               also represented relationally. [What does it mean to
store Q and P-values, while introducing variables and              represent an action relationally? Does it just mean that
allowing for generalization in the reinforcement                   the action (schema) has variables?] Because not all
learning process.                                                  actions are possible in all states, the algorithm can only
                                                                   consider actions that are possible given the current state,
4. Relational reinforcement learning                               and it must be able to determine from the state
                                                                   description whether a given action is possible. Also,
As described in section 2, traditional reinforcement               because of the relational representation of states and
learning stores the learned Q-values in an explicit state-         actions and the inductive logic programming component
action table, with one value for each possible                     of the algorithm, there must exist some body of
combination of the state and action. The relational                background knowledge which is generally true for the
reinforcement learning method introduced by [Džeroski              entire domain, as well as a language bias determining
et al., 1998] and [Džeroski et al., 2001] modifies this            how the learned policy will be represented.
aspect of reinforcement learning in order to introduce
variables and to increase the ability of the agent to              As in traditional reinforcement learning, the agent then
generalize learned knowledge to unseen parts of the                seeks to find an optimal policy π*, mapping states to
state space. Rather than using an explicit state-action            actions, that maximizes some long-run measure of
table, relational reinforcement learning stores the Q-             reinforcement.
values in a logical regression tree. There are several
important advantages to this approach:                             Perhaps the most important modification, however, is
                                                                   the method of storage of the learned Q-values. In
        It allows for larger or even infinite state spaces,       relational reinforcement learning, sets of state-action
         or state spaces of unknown size.                          pairs and their corresponding Q-values are used to
        By using a relational representation, it takes            induce a logical regression tree after each training
         advantage of the structural aspects of the task.          session. This tree then provides the Q-values for state-
        It provides a way to apply the learned solution           action pairs for the following training session. It also
         of one problem to other similar or more                   represents the policy which is being learned. In the
         complicated problems.                                     version     of    relational   reinforcement    learning
                                                                   corresponding to P-learning, the tree is induced with
                                                                   sets of state-action pairs and their corresponding label



                                                               5
(optimal or nonoptimal), and the resulting tree is a               then translated into example format for input into the
logical classification tree mapping state-action pairs to          TILDE-RT system. Figure 3 illustrates the Q-tree
labels.                                                            induced with the given sequence of examples as input.
                                                                   (These examples, as well as Figures 4 and 5, are taken
The Q relational reinforcement learning algorithm                  from [Džeroski et al., 1998]).

The following algorithm, developed by [Džeroski et al.,
                                                                        move(c,table)   move(b,c)       move(a,b)   move(a,table)
1998] and [Džeroski et al., 2001], is essentially identical             r=0             r=0             r=1         r=0
to a traditional Q-learning algorithm, except in its                    Q = 0.81        Q = 0.9         Q=1         Q=0
method of updating the Q-function, Qe (represented as a
regression tree after an episode e).
                                                                        c                                           a
    Q-RRL
       Initialize Q0 to assign 0 to all (s, a) pairs.                   b          b                b               b
       Initialize Examples to the empty set.                            a
                                                                        a         a     c        a     c            a
                                                                                                                    c
       e := 0                                                           a         a              a                  a
       do forever                                                  [The letters on the blocks got squished.]
            e := e + 1                                                  a         a a            a a                a
            i := 0                                                 Figure 1: A schematic representation illustrating one possible
            generate random state s0                               training session while using relational reinforcement learning
            while not goal (si) do                                 in the blocks world to accomplish the ordered stacking goal.
                 select an action ai stochastically
                    using the Q-exploration strategy                qvalue(0.81).                   qvalue(1.0).
                    using the current hypothesis for Qe             action(move(c,table)).          action(move(a,b)).
                 perform action ai                                  goal(on(a,b)).                  goal(on(a,b)).
                                                                    goal(on(b,c)).                  goal(on(b,c)).
                 receive immediate reward ri = r (si, ai)           goal(on(c,table)).              goal(on(c,table)).
                 observe the new state si + 1                       clear(c).                       clear(a).
                 i := i + 1                                         on(c,b).                        clear(b).
            endwhile                                                on(b,a).                        on(b,c).
                                                                    on(a,table).                    on(a,table).
            for j = i – 1 to 0 do                                                                   on(c,table).
                 generate example x = (sj, aj, qj),                 qvalue(0.9).
                    where qj := rj +  maxa’ Qe (sj+1, a’)          action(move(b,c)).            qvalue(0.0).
                                                                    goal(on(a,b)).                action(move(a,table)).
                 if an example (sj, aj, qold) exists in             goal(on(b,c)).                goal(on(a,b)).
                    Examples, replace it with x, else add           goal(on(c,table)).            goal(on(b,c)).
                    x to Examples                                   clear(b).                     goal(on(c,table)).
            update Qe using TILDE-RT to produce                     clear(c).                     clear(a).
                                                                    on(b,a).                      on(a,b).
                 Qe +1 using Examples                               on(a,table).                  on(b,c).
                                                                   Figure 2: The state-action-Q-value sets, from the sequence in
                                                                    on(c,table).                  on(c,table).
The algorithm uses the TILDE-RT system (described in               Figure 1, given as examples in relational format. This is
part 3) to induce the Q-tree. Because TILDE-RT is not              exactly the input that would be given to TILDE-RT.
incremental, a set of all examples encountered over all
training sessions is maintained, with the most recent Q-             goal_on(A,B),goal_on(B,C),goal_on(C,table),
                                                                                        numberofblocks(D)
value for each one, and used to reinduce the entire tree             on(A,B) ?
after each episode.                                                  +--yes: [0.0]
                                                                     +--no: clear(A) ?
An example for a state-action-Q-value set is represented                      +--yes: [1.0]
                                                                              +--no: clear(B) ?
as a set of relational facts, including all the facts of the                          +--yes: [0.9]
state itself, the relational form of the action, and the Q-                           +--no: [0.81]
value. State-action-Q-value sets are translated into
examples in this format in order to induce the tree, and           [Three comments about the above figure: You should
then a state-action pair can be translated into example            explain what the goal_on( . ) formalism means and how
format in order to retrieve the corresponding Q-value              it is used. (Is it different than goal(on( , ))?) And, it
from the tree during a training session. Figure 1                  seems there are some lines missing from the figure?
demonstrates a sequence of state-action-Q-value sets               Also, wouldn‟t this be a good place to mention the
encountered during a training session in the blocks                Prolog convention of capital letters for variables and l.c.
world domain, and Figure 2 shows how these sets are                for constants?



                                                               6
]                                                                                      else generate example (sj, ak, c)
Figure 3: The Q-tree induced from the examples in Figure 2.                              where c = 0
                                                                              update Pe using TILDE to produce Pe + 1 using
If the state is a goal state, the Q-value is automatically 0                    these examples (sj, ak, c)
regardless of the action; goal states are absorbing, so all
actions that would cause the agent to leave the goal state          Again, all actions from a goal state are assigned a label
must be considered nonoptimal (and thus have an                     of nonoptimal, for the reason discussed above.
absolute quality of 0).
                                                                    The same sequence of state-action-Q-value sets from
From Figure 3, it is apparent that the logical regression           Figure 1 is given in Figure 4, but in the format that
tree used in relational reinforcement learning differs              would be given as input to the TILDE system in order to
slightly from the definition of logical decision trees              induce the logical classification P-tree. The only
given in Part 3, in that it contains a root node with facts         difference is that the classifier label “optimal” or
that are guaranteed to be true for all possible examples.           “nonoptimal” is given, rather than a Q-value. Figure 5
[Which is the root node in Fig. 3?] This is necessary in            illustrates the corresponding P-tree.
order to bind the variables of the tree to the appropriate
constants, and this binding is then propagated to all                optimal.                        optimal.
nodes of the tree. Because we can bind the variables in              action(move(c,table)).          action(move(a,b)).
the root node to any set of appropriate constants, the               goal(on(a,b)).                  goal(on(a,b)).
learned tree can be used as a solution to any problem of             goal(on(b,c)).                  goal(on(b,c)).
                                                                     goal(on(c,table)).              goal(on(c,table)).
the same form, regardless of the actual names of the                 clear(c).                       clear(a).
blocks. This is one of the most significant                          on(c,b).                        clear(b).
improvements of the relational reinforcement learning                on(b,a).                        on(b,c).
                                                                     on(a,table).                    on(a,table).
over traditional Q-learning.                                                                         on(c,table).
                                                                     optimal.
The P relational reinforcement learning algorithm                    action(move(b,c)).            nonoptimal.
                                                                     goal(on(a,b)).                action(move(a,table)).
                                                                     goal(on(b,c)).                goal(on(a,b)).
Just as traditional P-learning is almost identical to Q-             goal(on(c,table)).            goal(on(b,c)).
learning, the P-learning counterpart to relational                   clear(b).                     goal(on(c,table)).
reinforcement learning introduces very few distinctions.             clear(c).                     clear(a).
Because our definition of the P-function depends upon                on(b,a).                      on(a,b).
                                                                     on(a,table).                  on(b,c).
the Q-function, we must learn both a Q and P-tree.                  Figure 4: The state-action-P-value sets, from the sequence in
                                                                     on(c,table).                  on(c,table).
However, the P-tree will in most cases prove to be                  Figure 1, given as examples in relational format. It is identical
simpler, and thus a more efficient method for choosing              to the input in Figure 2, but with classifier labels rather than
actions. The Q-tree is only learned in order to define the          Q-values. This is exactly the input that would be given to
                                                                    TILDE.
P-tree; it is not used to select actions.

[It isn‟t exactly clear how the P tree is learned. I take it          goal_on(A,B),goal_on(B,C),goal_on(C,table),
                                                                                         numberofblocks(D)
that a Q tree is learned first, and then it is used to                on(A,B) ?
construct training samples for the P-tree, which is then              +--yes: [nonoptimal]
learned from the samples by TILDE?]                                   +--no: [optimal]

                                                                    [Is there something missing from the above figure?
The algorithm is identical except for three differences:            Where is the action mentioned?]
the initial P-tree assigns a label of “optimal” to all state-
action pairs, actions are chosen using the P-exploration            Figure 5: The P-tree induced from the examples in Figure 3.
strategy given in part 2, rather than the Q-exploration
strategy, and the following loop is added after the Q-              5. Hierarchical relational reinforcement
function update loop, to update the P-function. Again,
this psuedocode was taken from [Džeroski et al., 2001]:             learning

         for j = i – 1 to 0 do                                      [Džeroski et al., 2001] thoroughly describe the method
              for all actions ak possible in state sj do            of relational reinforcement learning as described in part
                   if state-action pair (sj, ak) is optimal         4. They also offer a set of thorough experimental results
                     according to Qe + 1                            testing the effectiveness of the method at planning in the
                   then generate example (sj, ak, c)                blocks world domain. One of their most significant
                     where c = 1                                    findings is that relational reinforcement learning works



                                                                7
extremely well on very simple goals, but becomes less                   background knowledge. In a more advanced
effective at solving even slightly more complex                         implementation, the agent may learn what subgoals to
problems. For the goals of unstacking all the blocks                    use, as well as the solutions to the subgoals themselves.
onto the table, or building all the blocks into a simple
tower (in any order), the method gives very impressive                  Each subroutine policy is learned independently using
convergence rates to an optimal solution. However, it                   relational reinforcement learning, which produces a
converges more slowly with the goal of placing one                      (presumably optimal) Q or P-tree. The agent that is
specific block on top of another specific block. [I don‟t               learning the larger goal then creates instances of each
understand why putting one block on top of another                      subroutine by instantiating the variablized root nodes of
should be harder than building a tower. Is it because                   each tree once for each possible binding. In our blocks
that specific blocks are named in the former task?]                     world domain, for example, if the subroutine is
                                                                        makeClear(A) and there are three blocks in the world
Our implementation of relational reinforcement learning                 named a, b, and c, the agent would create three
seems to support this conclusion. We attempted to learn                 instances:
a more complicated goal: building all the blocks in the
world into a single ordered tower. [Shouldn‟t you say                       makeClear(a)
that “all the blocks in the world” amounted to only 3 or                    makeClear(b)
4 blocks?] We found great difficulty producing optimal                      makeClear(c).
solutions, or even improving the accuracy rate while
learning. Part 6 of this paper of gives the results of these            [I‟m beginning to worry a bit about the method. If we
preliminary experiments, and offers some discussion.                    used all instances of each subroutine, that would
                                                                        amount to a lot of specific actions to consider in big
This relative lack of success in solving more complex                   problems. Maybe we can talk a bit about how to get
problems suggests that the benefits of hierarchical                     around this.]
learning in a relational reinforcement learning context
could be quite profound. By learning first a set of useful              (We use the Prolog convention, which uses uppercase
routines for accomplishing subgoals, and then using                     characters for variables and lowercase characters for
these subroutines as the available action set to learn the              constants. [This convention should be mentioned
original goal, the agent could significantly reduce the                 earlier.]) For a subroutine makeOn(A,B) with three
number of training sessions required to converge upon                   blocks, the agent would create nine instances:
an optimal policy, and in the end could produce a
simpler representation of the optimal policy. Figure 6                      makeOn(a,b)
illustrates schematically how this idea could be                            makeOn(a,c)
approached in the blocks world domain. In order to test                     makeOn(a,table)
this hypothesis, we have developed a simple                                 makeOn(b,a)
implementation strategy for hierarchical relational                         makeOn(b,c)
reinforcement learning.                                                     makeOn(b,table)
                                                                            makeOn(c,a)
                                                                            makeOn(c,b)
                       Main Goal:                                           makeOn(c,table).
                   orderedStack(a,b,c)
                                                                        For a subroutine with no arguments, such as unstack, a
                                                                        single instantiation is created.
   Subgoal:               Subgoal:              Subgoal:
   clear(A)               on(A,B)               unstack                 Note that the instantiation of subroutines is only
                                                                        possible because relational reinforcement learning
Figure 6: The main goal in the ordered stacking task can be             produces variablized trees, so one solution can be used
accomplished by achieving some combination of subgoals in a             to solve many problems. Past attempts at hierarchical
particular order. It is this structure of a learning problem that       reinforcement learning have been slowed without this
hierarchical learning hopes to exploit.                                 feature.

An implementation of               hierarchical      relational         After instantiating all subroutines, the agent then uses
reinforcement learning                                                  the available subroutines as its set of possible actions
                                                                        while learning a policy for achieving the main goal. We
In this simple implementation, the choice of subgoals                   define the preconditions of a subroutine to be simply
that will be learned is supplied to the agent as a form of              that its subgoal is not already accomplished in the given


                                                                    8
state. Thus subroutines have the power to be much more                another action out of state si would be chosen is if there
effective than simple primitive actions, as they can                  exists a shorter path to the goal that does not pass
make a particular subgoal true, in the fewest number of               through si + 3.
steps, starting in any state.
                                                                      Using this modified Q update function, we can run the
In the Q-learning process, subroutines, like primitives,              relational reinforcement learning algorithm as usual, but
are assigned Q values that reflect the quality of taking              using subroutines to improve the efficiency of the
that action in a particular state. (The P-function is                 learning.
defined as usual using the Q-function). However, the
original Q update function proves to be inaccurate when               6. Experiments
used with subroutines. Consider the example depicted in
Figure 7.                                                             In order to test the hypothesis that hierarchical relational
                                                                      reinforcement learning overcomes some of the
                  subroutine a                                        ineffectiveness of relational reinforcement learning on
                                                                      more complicated goals, we conducted a set of
                                                                      experiments aimed at comparing the accuracy over time
                                                  [0.81]              of the learned regular policies and the learned
       si         si+1       si+2          si+3            si+4       hierarchical policies.

            (primitive actions taken                                  Experimental setup
                by subroutine a)
Figure 7: A possible sequence of states visited by the agent          We conducted two sets of experiments. In the each set
during a training session. The actions explicitly taken by the        we performed 10 total runs of 100 training sessions each
agent are designated by the bold arrows, while the primitive          in the 3-blocks world, 5 runs using very limited
actions taken by the subroutine are shown with dotted arrows.
                                                                      background knowledge, and 5 runs using a more
                                                                      extended body of background knowledge. In the first set
Usually the state action pair (si, a) would be assigned a
                                                                      of experiments, we did not use hierarchical learning,
new Q-value as follows:
                                                                      and in the second set we did. [NOTE: Some runs take
                                                                      over 10 hours to complete. Due to time constraints,
    newQ = rj +  maxa’ Qe (sj + 3, a’),                              experiments were only conducted in the 3-blocks world.
                                                                      The results from experiments conducted in the 4 and 5-
where maxa’ Qe (sj+3, a’) in this case is 0.81 and rj is 0,
                                                                      blocks world are forthcoming.] [You might comment
so newQ would be 0.72 (with a discount factor  of 0.9).              here that your interfacing of code from others with your
But this is misleading, because it implies that si is much            java code was the main reason the runs took a long time
closer to the goal than it actually is. In later training             to complete.]
sessions, the agent will be unfairly biased towards
choosing a path through state si.                                     The current state was described using the following
                                                                      primitive predicates:
In order to preserve the accuracy of the Q-function, we
remember the number of primitive moves taken by the                           clear(A): true if block A has no other blocks on
subroutine from state si, say n, to its next state si + 3, and                 top of it
we calculate the newQ with the following modified
                                                                              on(A,B): true if block A is directly on top of B.
update equation:
                                                                               B may be a block or the table.
    newQ = rj +  n maxa’ Qe (sj + 3, a’),                            The goal task was to stack a set of blocks in an ordered
                                                                      tower. Specifically, the optimal solution for each
where n is the number of primitive moves taken by the                 number of blocks was:
subroutine just completed. Because ordinary primitive
moves trivially contain only one primitive (n = 1), the
                                                                              3 blocks: on(a,b), on(b,c), on(c,table).
modified function works exactly the same as the
                                                                              4 blocks: on(a,b), on(b,c), on(c,d), on(d,table).
original function for the case of primitive actions.
                                                                              5 blocks: on(a,b), on(b,c), on(c,d), on(d,e),
Since we assume the subroutines have already been                              on(e,table).
learned to be optimal, this is effectively a shortcut on
the learning path. Because the subroutine is definitely               For the non-hierarchical learning experiments, the
the shortest path between states si and si + 3, the only way          available action set consisted of all possible forms of the



                                                                  9
primitive move(A,B), which may only be executed if                 where f = (total number of training sessions in the
both A and B are clear. The second argument can be the             current run) / 10. We also used the TILDE and TILDE-
table.                                                             RT systems, as part of the ACE Data Mining System,
                                                                   [Blockeel et al., 2001], provided by Hendrik Blockeel at
For the hierarchical learning experiments, we first                Katholieke Universiteit Leuven, Belgium.
learned optimal policies for the following subroutines:
                                                                   Experiments conducted with a limited body of
        unstack: goal reached if all blocks in the world          background knowledge provided to TILDE and TILDE-
         are on the table                                          RT used only the following pieces of informations:
        makeOn(A,B): goal reached if block A is on
         block B                                                           eq(X,Y): returns true if blocks X and Y are the
                                                                            same block
The subroutine makeClear(A), although it is useful in                      block(X): a variable is considered a block (as
achieving the goal of an ordered stack, is itself a                         opposed to the table) if and only if it is found
subroutine of makeOn(A,B), and thus was omitted. [I                         as the first argument in some fact on(X,Y).
don‟t understand---did you use a 3-level hierarchy?]
Figures 8 and 9 give the optimal learned P-trees used for          Experiments conducted with the more extended body of
each subroutine. The instantiations of each of these               background knowledge used essentially the same
subroutines then became the possible action set for                information used by [Džeroski et al., 2001]. These
accomplishing the goal task above.                                 included:

 goal_unstack,numberofblocks(A)                                            above(X,Y): returns true if block X is above
 action_move(B,table) ?                                                     block Y in some stack.
 +--yes: [optimal]
 +--no: [nonoptimal]
                                                                           height(X,N): returns true if block X is at height
                                                                            N above the table.
Figure 8: An optimal P-tree for the subroutine unstack,                    numberofblocks(N): returns true if there are N
learned after 10 training sessions.
                                                                            blocks in the world.
                                                                           numberofstacks(N): returns true if there are N
 goal_on(A),numberofblocks(B)
                                                                            stacks in the current state.
 action_move(D,table) ?
 +--yes: clear(A) ?                                                        diff(X,Y,Z): returns true if X – Y = Z.
 |       +--yes: clear(B) ?
 |       |       +--yes: [nonoptimal]               Note that this information only affects the learning
 |       |       +--no: on(A,B) ?
 |       |               +--yes: on(B,table) ?      process in that it is made available to TILDE during the
 |       |               |                          tree induction process. It does not affect the way states
                                 +--yes: [nonoptimal]
 |       |               |       +--no: [optimal] are represented or how actions are carried out.
 |       |               +--no: [optimal]
 |       +--no: [optimal]
 +--no: action_move(A,B) ?                          Experimental results
         +--yes: [optimal]
         +--no: [nonoptimal]
[Should goal_on(A) in the above be goal_on(A,C) or The results of the 3-block runs with limited background
something?      And shouldn‟t there be more to this                knowledge, and extended background knowledge are
figure?]                                                           given in Figures 10 and 11, respectively. The accuracy
                                                                   of the learned P function for hierarchical and non-
                                                                   hierarchical learning is plotted over time. The accuracy
Figure 9: An optimal P-tree for the subroutine makeOn(A,B),
                                                                   is defined as the percentage of state-action pairs that are
learned after 51 training sessions.
                                                                   correctly classified as optimal or nonoptimal using the
                                                                   learned P function. The accuracy of all 5 runs of
Our implementation uses the equations and algorithms
                                                                   hierarchical learning is averaged and plotted after each
described in parts 2–5 of this paper. We used a discount
                                                                   training session. The equivalent representation for non-
factor  of 0.9 for the Q update function and a starting
                                                                   hierarchical learning is given on the same graph in order
temperature T of 5 for the Q and P exploration
                                                                   to illustrate the contrast.
functions. We used the following equation as a schedule
for temperature decrease, given a factor f and a current
                                                                   Error! Not a valid link.Figure 10: Hierarchical and non-
training session number e:                                         hierarchical learning in the 3-blocks world, with limited
                                                                   background knowledge.
    T (f, e) = f / (e + f),




                                                              10
    Error! Not a valid link. Figure 11: Hierarchicalgoal_on(A,table),goal_on(B,A),goal_on(C,B)
                                                     and non-
                                                       extended
    hierarchical learning in the 3-blocks world, with action_makeOn(A,D) ?
    background knowledge.                             +--yes: [nonoptimal]
                                                        +--no: clear(B) ?
    [Whoops---no figures]                                     +--yes: on(A,C) ?
                                                              |      +--yes: [nonoptimal]
                                                              |      +--no: action_makeOn(B,A) ?
    [NOTE: Again, due to time constraints, only one|                         +--yes: clear(A) ?
    execution of the hierarchical learning with extended|                    |       +--yes: [optimal]
    background knowledge was completed. The accuracy of       |              |       +--no: [nonoptimal]
                                                              |              +--no: action_makeOn(C,table) ?
    this one run is plotted against the average accuracy of|                         +--yes: clear(C) ?
    the non-hierarchical case. More statistically sound|                             |        +--yes: [optimal]
    results are forthcoming.]                                 |                      |        +--no: [nonoptimal]
                                                              |                      +--no: on(B,A) ?
                                                              |                               +--yes: action_makeOn(C,B)?
    The first observation we can make from the results in|                                    |        +--yes:[optimal]
    both Figures 10 and 11 is that relational reinforcement|                                  |        +--no: [nonoptimal]
    learning, without hierarchical learning, makes very little|                               +--no: [nonoptimal]
                                                                      action_makeOn(B,A)
    increase in accuracy over time. Whether or not extended+--no: +--yes: on(C,table) ?      ?
    background knowledge is used, the average accuracy of            |       +--yes: [nonoptimal]
    the learning program seems to hover around 80%                   |       +--no: [optimal]
    accuracy. This contributes more evidence to the                  +--no: on(C,table) ?
                                                                             +--yes: action_makeOn(C,A) ?
    conclusion drawn by [Džeroski et al., 2001], that                        |       +--yes: [nonoptimal]
    relational reinforcement learning becomes much less                      |       +--no: [optimal]
    effective when used on slightly harder problems.                         +--no: on(A,table) ?
                                                                                     +--yes: action_makeOn(C,R) ?
                                                                                     |        +--yes: on(B,R) ?
    Second, From Figure 10, we see that hierarchical                                 |        |        +--yes: [optimal]
    learning, with limited background knowledge, contrary [Adjust figure to fit in column.] |
                                                                                     |                 +--no: [nonoptimal]
    to our hypothesis, in fact makes very little improvement                         |        +--no: [nonoptimal]
                                                                                     +--no: [nonoptimal]
                                                                 Figure 13: An optimal P-tree for ordered stacking, learned
    with respect to the non-hierarchical version. The
                                                                         after 45 training sessions with hierarchical learning and
    individual runs in both cases do at times hit 100%                   limited background knowledge.
    accuracy, but the averaged results displayed on the
    graph demonstrate that this does not occur consistently.             Perhaps, however, this lack of improvement is due to
    However, we can give examples of optimal P-trees for                 the limited background knowledge available. We turn to
    both hierarchical and non-hierarchical learning                      Figure 11, the results of the preliminary experiments
    produced at these inconsistent occasions; these are                  conducted with extended background knowledge, to
    shown in Figures 12 and 13.                                          determine if this is true.

goal_on(A,table),goal_on(B,A),goal_on(C,B)                               While the results of the hierarchical run illustrated in
action_move(D,table) ?
+--yes: eq(D,B) ?
                                                                         Figure 11 is by no means statistically sound, it suggests
|       +--yes: clear(C) ?                                               that the results of more thorough tests would be very
|       |           +--yes: [nonoptimal]                                 similar to the hierarchical learning performed with
|       |           +--no: clear(A) ?                                    limited background knowledge. In both cases, the
|       |                        +--yes: [nonoptimal]
|       |                        +--no: [optimal]                        hierarchical learning does not seem to show significant
|        +--no: [optimal]                                                improvement over non-hierarchical learning. [It‟s
+--no: action_move(B,A) ?                                                possible (probable?) that the lack of improvement is due
         +--yes: on(A,C) ?                                               to the fact that the problem is too small for hierarchical
         |          +--yes: [nonoptimal]
         |          +--no: [optimal]                                     learning to be needed.]
         +--no: on(B,A) ?
                    +--yes: action_move(C,B) ?                           7. Conclusions and further work
                    |            +--yes: [optimal]
                    |            +--no: [nonoptimal]
    [Adjust figure to fit in column.]
                    +--no: [nonoptimal]                                  Naturally the first and most important avenue for further
                                                                         work will be to complete the experimental setup
    Figure 12: An optimal P-tree for ordered stacking, learned           described here, first for the remainder of the 3-block
    after 24 training sessions with non-hierarchical learning and        hierarchical runs with extended background knowledge,
    limited background knowledge.                                        and then to repeat the entire set of experiments in the 4
                                                                         and 5-blocks world. It is possible that as the state space
                                                                         and the complexity of the problem increases, the


                                                                    11
contrast between hierarchical and non-hierarchical               acknowledgements are due to Hendrik Blockeel and
accuracy will become more pronounced. In addition, it            Kurt Driessens at Katholieke Universiteit Leuven for
is possible to observe a very slight but perceptible             their patience in answering my questions regarding the
increase in the average accuracy of all the learning             TILDE system, and for making the system available to
methods over the course of the 100 training sessions.            me for this project.
With enough time and computing power, it would be
interesting to learn how the graph appeared after 500 or         References
1000 training sessions.
                                                                 [Blockeel, 1998] Blockeel, H. Top-down Induction Of
However, it may be the case that our simple strategy for           First Order Logical Decision Trees. Department of
hierarchical learning is not sophisticated enough to               Computer Science, Katholieke Universiteit Leuven,
significantly improve the overall performance of the               Belgium, December 1998.
relational reinforcement learning algorithm. This
possibility warrants further research into more                  [Blockeel et al., 2001] Blockeel, H., Dehaspe, L., and
sophisticated hierarchical methods: using recursion,               Ramon, J. “The ACE Data Mining System: User‟s
creating nested hierarchies, taking greater advantage of           Manual.” Department of Computer Science,
the structure of the problem, or working more closely              Katholieke Universiteit Leuven, Belgium, October
with the available background knowledge.                           2001.

Finally, further work into ways of improving relational          [Blockeel and De Raedt, 1997] Blockeel, H. and De
reinforcement learning in general would be fascinating.            Raedt, L. “Top-down Induction of Logical Decision
For example, as stated above, the program at times                 Trees.” Department of Computer Science, Katholieke
stumbles upon a completely optimal Q or P-tree, and                Universiteit Leuven, Belgium, January 1997.
then, due to a high temperature value, wanders away
from the optimal solution.                                       [De Raedt and Blockeel, 1997] De Raedt, L. and
[I‟ll bet that the annealing schedule needs to be                  Blockeel, H. “Using Logical Decision Trees For
accelerated. That is, the temperature ought to be                  Clustering.” Proceedings of the Seventh International
decreased more rapidly.] It would be interesting to                Workshop on Inductive Logic Programming, 133-141,
investigate ways of introducing a type of evaluation               1997.
function, similar to that used in genetic programming,
which would greatly decrease the temperature or even             [Džeroski et al., 1998] Džeroski, S., De Raedt, L., and
halt the program when an optimal or near-optimal                   Blockeel, H. “Relational Reinforcement Learning.”
solution is found, even if it is early in the run.                 Proceedings of the Fifteenth International Conference
                                                                   on Machine Learning, 136-143. Morgan Kaufmann,
Relational reinforcement learning offers great potential:          1998.
it combines two extremely powerful machine learning
techniques, and uses each one to overcome the other‟s            [Džeroski et al., 2001] Džeroski, S., De Raedt, L., and
drawbacks. Similarly, a hierarchical approach to the               Driessens, K. “Relational Reinforcement Learning.”
learning problem appears as though it would exploit the            Machine Learning, 43:7-52. The Netherlands: Kluwer
benefits and avoid the weaknesses of           relational          Academic Publishers, 2001.
reinforcement learning. Despite the rather unpromising
results of the simple and small-scale experiments                [Finney et al., 2002] Finney, S., Gardiol, N., Kaelbling,
conducted here, I believe that relational reinforcement            L. P., and Oates, T. “Learning with Deictic
learning warrants a great more attention and thought, if           Representation.” Massachusetts Institute of
only to discover more concretely its specific strengths            Technology, Artificial Intelligence Laboratory,
and weaknesses.                                                    Cambridge, MA, April 2002.

Acknowledgements                                                 [Kaelbling et al., 1996] Kaelbling, L. P., Littman, M.,
                                                                   and Moore, A. “Reinforcement Learning: A Survey.”
This work has been supported by the Computer Science               Journal of Artificial Intelligence Research, 4:237-285,
Undergraduate Research Internship program at Stanford              1996.
University. A special thanks to Professor Nils Nilsson
for his boundless support, encouragement, and                    [Nilsson, 1996] Nilsson, N. J. Introduction To Machine
enthusiasm, and to my research colleagues Mykel                    Learning. Department of Computer Science, Stanford
Kochenderfer and Praveen Srinivasan for their valuable             University, CA, September, 1996. Unpublished.
commentary     on     my     work.    Also,    grateful


                                                            12

								
To top