VIEWS: 8 PAGES: 12 POSTED ON: 7/21/2011 Public Domain
Hierarchical Relational Reinforcement Learning Meg Aycinena Department of Computer Science Stanford University Stanford, California, 94305 aycinena@cs.stanford.edu Abstract goals such as clearing one block, or placing one block on top of another. However, when faced with a more Experiments with relational reinforcement complex goal such as building an ordered tower of learning, a combination of reinforcement blocks, relational reinforcement learning works less learning and inductive logic programming, are effectively than its traditional reinforcement learning presented. This technique offers greater precursor.[Meg: isn‟t this comparing apples with expressive power than that offered by traditional oranges? The “traditional reinforcement learning reinforcement learning. We use it to find general precursor” led to huge state-action tables that could only solutions in the blocks world domain. We solve problems with specific, named blocks. RRL could discuss some difficulties associated with solve problems with arbitrarily named blocks.] relational reinforcement learning, specifically its decreased effectiveness when presented with The introduction of a hierarchical learning strategy may more complex problems. Finally, we explore harness the effectiveness of relational reinforcement ways in which hierarchical methods and learning learning in solving simpler problems to the solution of subroutines may be able to overcome some of more complicated goals. Our hypothesis states that by these weaknesses. first learning a set of useful subroutines and then learning to combine these subroutines to achieve a 1. Introduction stated goal, an agent may potentially converge more consistently and more quickly upon an optimal solution. Reinforcement learning can be a powerful machine learning technique, but it suffers from an inability to Some work has been done on hierarchical learning in generalize well over large state spaces, or to reuse the traditional reinforcement learning [Kaelbling et al., results of one learning session on a similar, but slightly 1996], but has been hindered due to the lack of different or more complicated problem. Džeroski et al. variablized solutions to subproblems and the inability to [Džeroski et al., 1998], [Džeroski et al., 2001], have generalize those solutions to slightly different problems. developed a way to combine reinforcement learning Because relational reinforcement learning can produce with inductive logic programming, to produce a more variablized solutions to subproblems and because it has flexible, expressive approach that circumvents many of been shown to perform well on simpler problems, the problems that have plagued traditional hierarchical learning is easier to implement and possibly reinforcement learning. Specifically, the method more powerful than its traditional counterpart. modifies the traditional Q-learning technique so that it Hierarchical relational reinforcement learning promises stores the Q-values of state-action pairs in a top-down to exploit the advantages and overcome some of the induced logical decision tree. The method allows the weaknesses of relational reinforcement learning in order use of variables, eliminates the need for an explicit, to improve the actual process of learning, as well as the memory-expensive state-action table, and generalizes applicability of the learned solution. In this paper, we already learned knowledge to unseen parts of the state investigate whether this actually occurs. space. The paper is organized as follows. In Section 2 we Although relational reinforcement learning greatly describe traditional reinforcement learning, Q-learning, improves upon traditional reinforcement learning, it and how P-learning extends it. In Section 3 we briefly tends to work less effectively with more complicated summarize inductive logic programming, and describe goals. Its problem solving ability in the classic blocks top-down induction of logical decision trees, as world domain, for example, is impressive with simple implemented in the TILDE system, [Blockeel, 1998]. In 1 Section 4 we describe relational reinforcement learning, results they produce. In Q-learning, each possible state- and how we have implemented it according to the action pair is assigned a quality value, or “Q-value” and design of [Džeroski et al., 2001]. Section 5 describes a actions are chosen by selecting the action in a particular simple implementation strategy for hierarchical state which has the highest Q-value. relational reinforcement learning, and Section 6 briefly offers the results of the preliminary experiments we The process works as follows. The agent can be in one have conducted in order to explore its effectiveness. state of the environment at a time. In a given state si Section 7 concludes and discusses possible further S, the agent selects an action ai A to execute work. according to its policy π. Executing this action puts the agent in a new state si+1, and it receives the reward ri+1 2. Traditional reinforcement learning, Q- associated with the new state. The policy in Q-learning learning, and P-learning is given by: The term “reinforcement learning” and some of the π(s) = arg maxa Q(s, a),and thus by learning an general characteristics of the technique have been optimal function Q*(s, a) such that we maximize the borrowed from cognitive psychology, but in recent total reward r, we can learn the optimal policy π*(s). years it has become increasingly popular within the We update Q(s, a) after taking each action so that it computer science disciplines of artificial intelligence better approximates the optimal function Q*(s, a). We and machine learning. See [Kaelbling et al., 1996] for a do this with the following equation: broad survey of the field as it has been applied within computer science. In its simplest form, reinforcement Q (si, ai) := ri + maxa’ Q (si+1, a’), learning describes any machine learning technique in which a system learns a policy to achieve a goal through [Meg: you should explain somewhere what a’ refers to. a series of trial-and-error training sessions, receiving a And you probably want a' instead of a’. That is „prime‟ reward or punishment after each session, and learning instead of a single closed-quote mark.] from this “reinforcement” in future sessions. (Note: we shall use the terms “training session” and “episode” interchangeably to refer to one cycle of state-action where is the discount factor, 0 < < 1. sequences culminating in a goal state.) In the first few training sessions, the agent has not yet Formally, [Kaelbling et al., 1996] describe the learned even a crude approximation of the optimal Q- reinforcement learning problem as follows: function, and therefore the initial actions will be random if the initial Q‟s are. However, due to this randomness, a discrete set of environment states, S, the danger exists that, although the Q-function may a discrete set of agent actions, A, and eventually converge to a solution to the problem, it may a set of scalar reinforcement signals, typically not find the optimal solution to the problem, because the {0, 1}, or the real numbers. initial path it stumbles upon may be a correct but sub- optimal path to the goal. Therefore it is necessary that The goal of the agent is to find a policy π, mapping every reinforcement learning agent have some policy of states to actions, that maximizes some long-run measure exploration of its environment, so that, if a better of reinforcement. We will denote the optimal policy as solution exists than the one it has already found, it may π*. Various techniques exist within reinforcement stumble upon it. learning for choosing an action to execute in a particular state. Q-learning is one of the more traditional of these One popular method (e.g. [Džeroski et al., 2001]) for techniques. arbitrating the tradeoff between exploitation (using knowledge already learned) and exploration (acting Q-learning randomly in the hopes of learning new knowledge) is to choose each action with a probability proportionate to Q-learning falls within the “model-free” branch of the its Q-value. Specifically: reinforcement learning family, because it does not require the agent to learn a model of the world or Pr (ai | s) = T-Q (s, ai) / j T-Q (s, aj) . environment (i.e. how actions lead from state to state, and how states give rewards) in order to learn an [Meg: can you fix the subscripts so they aren‟t so low?] optimal policy. Instead, the agent interacts with the The temperature, T, starts at a high value and is environment by executing actions and perceiving the decreased as learning goes on. At the beginning, the 2 high value of T ensures that some actions associated relationships from a set of specific data instances, hence with low Q-values are taken – encouraging exploration. the use of the term “inductive”. This contrasts with the As the temperature is decreased, most actions are those well-established field of deductive reasoning, in which a associated with high Q-values – encouraging conclusion is inferred from a given set of premises exploitation. This is the method of choosing actions that through the application of accepted logical rules. In we will use. most cases, ILP has been used to learn concepts so that a set of unseen examples may be classified correctly. P-learning Decision trees Like Q-learning, P-learning is a technique for choosing actions in a particular state, presented by [Džeroski et One popular method for performing the classification of al., 2001]. Indeed, P-learning is almost identical to Q- examples is the decision tree. A decision tree consists of learning, except that rather than remembering a real- a set of nodes, each of which is either an internal node valued scalar to represent the quality of each state- (meaning it has a set of other nodes as its children), or a action pair, the agent only remembers a 1 if the action in leaf node (in which case it has no children). the given state is optimal, and a 0 if the action is nonoptimal. Formally: Each internal node contains a test that is used to split the possible set of examples into several mutually exclusive if a π*(s) then P(s, a) = 1 else P(s, a) = 0. groups. The node then passes the current example down the branch corresponding to the result of the test. Often, [Meg: since you are talking about optimal actions here, the tests return Boolean results, so the tree is binary, shouldn‟t you be using P*? And, what‟s the rationale with each node classifying the example as either passing (throughout) for your use of boldface symbols?] or failing its particular test. Decision trees with this property are called binary decision trees. In practice, the P function is computed from the Q function (which must also be learned), but it can be Leaf nodes do not contain tests, but rather hold the stored more compactly and used later more easily. The classification label for examples that are sorted into that P function is expressed in terms of the Q function as node. Thus an example can be deterministically follows: classified by a decision tree. if a arg maxa Q(s, a) then P(s, a) = 1 else Logical decision trees P(s, a) = 0. According to [Blockeel and De Raedt, 1997], logical We can also incorporate exploration into P-learning decision trees are binary decision trees which fulfill the through the analogous equation to the one for choosing following two properties: actions in Q-learning: The tests in the internal nodes are first-order Pr (ai | s) = T-P (s, ai) / jT -P (s, a ) j . logical conjunctions of literals. A variable that is introduced in a particular [Fix subscripts] node (i.e. one which does not exist higher in the tree), can only occur in the subtree We will see that P-learning, although it is built directly corresponding to successful tests. on top of Q-learning and may seem rather redundant, produces smaller, simpler, and more efficient The second requirement is necessary due to the fact that representations of the optimal solution and therefore is if a test referring to a newly introduced variable fails, extremely valuable towards our goal of creating the agent cannot be sure that the variable refers to hierarchical programs. anything at all, and therefore it does not make sense to continue to refer to it. 3. Inductive logic programming and top- We illustrate the concept of logical decision trees with down induction of logical decision trees this simple example, taken from [Džeroski et al., 2001]: Inductive logic programming (ILP) refers to a subdiscipline of machine learning that deals with learning logic programs. ILP attempts to use reasoning to derive useful and general principles, rules, and 3 not(N.test) are not necessarily complementary for all clear(a) possible substitutions. This distinction does not exist in the propositional case. For a more detailed description of the logical inequivalence of the two queries, see false clear(b) ([Blockeel and De Raedt, 1997], p.4-6). [Meg: I didn‟t find the bulleted items and the paragraph false clear(c) that followed them very clear. As a matter of fact, I don‟t think I understand them at all. If you are assuming that the reader has read and understands false true Blockeel and De Raedt, this might be ok as a quick summary. But if you want your paper to be at all self- This logical decision tree determines whether all the contained (which I think would be a good idea), it blocks are unstacked in the 3-blocks world. would be important to explain all this. And an example of a tree with tests and queries would be important. (I [Meg: All of your other figures have numbers. How don‟t understand the difference between a test and a come this one doesn‟t?] query.] I‟d be glad to have you explain all this to me in person, and then maybe I could give some advice about The fact that we are using variables also necessitates the explaining it in writing.] use of more complicated tests than those used in the propositional case. Internal nodes must contain a query Thus logical decision trees allow the classification of including both the unique predicate introduced in the examples described in first order (rather than current node, and the predicates introduced in all the propositional or attribute value) logic, and therefore parent nodes of the current node. The following process offer much greater expressiveness and flexibility than for associating clauses and queries to nodes is adapted traditional decision trees. This characteristic makes from [Blockeel and De Raedt, 1997]. We assume that them ideal candidates for use in relational reinforcement examples sorted down the tree are passed to the left learning, which seeks to allow variablization and subtree if the node‟s test returns true, and to the right generalization over similar inputs. subtree if it returns false. [Meg: This convention is different than the one used in the figure above. Is that a Top-down induction of logical decision trees problem?] We also assume that N.test returns the unique predicate assigned to the node N: In order to use logical decision trees for classification, however, we must have an algorithm for inducing them With the root node T, the clause c0 ← T.test from a set of example instances. [Blockeel and De and the empty query are associated. Raedt, 1997] offer the following algorithm, based on the For every internal node N with associated classic TDIDT (top-down induction of decision trees) clause ci ← P and associated query Q, the algorithm, but modified to induce logical decision trees. query P is associated with its left subtree. The It assumes as input a set of labeled examples E, query Q, not(ci) is associated with the right background knowledge B, and a language L specifying subtree. what kind of tests may be used in the decision tree. Both With each internal node N with associated B and L are assumed to be available implicitly. query Q, the clause ci ← Q, N.test is associated; i is a number unique for this node. buildtree (T, E, Q): Each leaf node N has no associated test, but if E is homogeneous enough only the query Q that it inherited from its then parent. K := most_frequent_class (E) T := leaf (K) It is important to note that we associate the query Q, else not(ci) with the right subtree, and not the expected query Qb := best element of ρ(Q), according to Q, not(N.test), which would be symmetric with the left some heuristic subtree. This is necessary because the queries of the left T.test := C‟ where Qb ← Q, C’ and right subtrees must be complementary; only one of the two queries associated with the two children of each E1 := { E E | E U B |= Qb } node must succeed on any example passing through that E2 := { E E | E U B |≠ Qb } node. Because it is possible that the N.test shares buildtree (T.left, E1, Q’ ) variables with Q, the two queries Q, N.test and Q, buildtree (T.right, E2, Q ) 4 [Replace quotes by primes in above (and Thus relational reinforcement learning offers the throughout)] potential to overcome many of the serious drawbacks The refinement operator ρ determines the language bias that traditional reinforcement learning has presented in (and thus serves the role of the language L). [Blockeel the past. et. al., 2001] use a refinement operator that adds one or more literals to the clause under θ-subsumption. A Modifications to traditional Q- learning clause c1 subsumes another clause c2 if and only if there is a substitution θ such that c1 θ c2. In other words, a Relational reinforcement learning as a technique is a refinement operator under θ-subsumption maps a clause more specific version of its traditional Q-learning roots. onto a set of clauses, each of which is subsumed by the As in reinforcement learning, the algorithm has initial clause. It is from this set of new clauses that the available to it: conjunction to be put in a new node is chosen. a set of possible states, S, [Can you think of a short example to illustrate how the a set of agent actions, A, algorithm works?] a transition function which, although it may not be explicitly known by the algorithm, will The TILDE system, which [Blockeel and De Raedt, return the next state given the current state and 1997], [Blockeel, 1998], and [Blockeel et. al., 2001] an action to perform. introduce and describe, implements the algorithm a reward function that returns scalar above. It can be used to induce both classification and reinforcement signals given a state. regression logical decision trees, using the equivalent regression tree system TILDE-RT. (Regression trees are In relational reinforcement learning, however, more identical in definition to classification trees, except that specific rules govern the learning process. States are they contain real-valued numbers in their leaves rather represented as sets of first-order logical facts, and the than classifier labels.) We will see how logical decision algorithm can only see one state at a time. Actions are trees, induced with the TILDE system, can be used to also represented relationally. [What does it mean to store Q and P-values, while introducing variables and represent an action relationally? Does it just mean that allowing for generalization in the reinforcement the action (schema) has variables?] Because not all learning process. actions are possible in all states, the algorithm can only consider actions that are possible given the current state, 4. Relational reinforcement learning and it must be able to determine from the state description whether a given action is possible. Also, As described in section 2, traditional reinforcement because of the relational representation of states and learning stores the learned Q-values in an explicit state- actions and the inductive logic programming component action table, with one value for each possible of the algorithm, there must exist some body of combination of the state and action. The relational background knowledge which is generally true for the reinforcement learning method introduced by [Džeroski entire domain, as well as a language bias determining et al., 1998] and [Džeroski et al., 2001] modifies this how the learned policy will be represented. aspect of reinforcement learning in order to introduce variables and to increase the ability of the agent to As in traditional reinforcement learning, the agent then generalize learned knowledge to unseen parts of the seeks to find an optimal policy π*, mapping states to state space. Rather than using an explicit state-action actions, that maximizes some long-run measure of table, relational reinforcement learning stores the Q- reinforcement. values in a logical regression tree. There are several important advantages to this approach: Perhaps the most important modification, however, is the method of storage of the learned Q-values. In It allows for larger or even infinite state spaces, relational reinforcement learning, sets of state-action or state spaces of unknown size. pairs and their corresponding Q-values are used to By using a relational representation, it takes induce a logical regression tree after each training advantage of the structural aspects of the task. session. This tree then provides the Q-values for state- It provides a way to apply the learned solution action pairs for the following training session. It also of one problem to other similar or more represents the policy which is being learned. In the complicated problems. version of relational reinforcement learning corresponding to P-learning, the tree is induced with sets of state-action pairs and their corresponding label 5 (optimal or nonoptimal), and the resulting tree is a then translated into example format for input into the logical classification tree mapping state-action pairs to TILDE-RT system. Figure 3 illustrates the Q-tree labels. induced with the given sequence of examples as input. (These examples, as well as Figures 4 and 5, are taken The Q relational reinforcement learning algorithm from [Džeroski et al., 1998]). The following algorithm, developed by [Džeroski et al., move(c,table) move(b,c) move(a,b) move(a,table) 1998] and [Džeroski et al., 2001], is essentially identical r=0 r=0 r=1 r=0 to a traditional Q-learning algorithm, except in its Q = 0.81 Q = 0.9 Q=1 Q=0 method of updating the Q-function, Qe (represented as a regression tree after an episode e). c a Q-RRL Initialize Q0 to assign 0 to all (s, a) pairs. b b b b Initialize Examples to the empty set. a a a c a c a c e := 0 a a a a do forever [The letters on the blocks got squished.] e := e + 1 a a a a a a i := 0 Figure 1: A schematic representation illustrating one possible generate random state s0 training session while using relational reinforcement learning while not goal (si) do in the blocks world to accomplish the ordered stacking goal. select an action ai stochastically using the Q-exploration strategy qvalue(0.81). qvalue(1.0). using the current hypothesis for Qe action(move(c,table)). action(move(a,b)). perform action ai goal(on(a,b)). goal(on(a,b)). goal(on(b,c)). goal(on(b,c)). receive immediate reward ri = r (si, ai) goal(on(c,table)). goal(on(c,table)). observe the new state si + 1 clear(c). clear(a). i := i + 1 on(c,b). clear(b). endwhile on(b,a). on(b,c). on(a,table). on(a,table). for j = i – 1 to 0 do on(c,table). generate example x = (sj, aj, qj), qvalue(0.9). where qj := rj + maxa’ Qe (sj+1, a’) action(move(b,c)). qvalue(0.0). goal(on(a,b)). action(move(a,table)). if an example (sj, aj, qold) exists in goal(on(b,c)). goal(on(a,b)). Examples, replace it with x, else add goal(on(c,table)). goal(on(b,c)). x to Examples clear(b). goal(on(c,table)). update Qe using TILDE-RT to produce clear(c). clear(a). on(b,a). on(a,b). Qe +1 using Examples on(a,table). on(b,c). Figure 2: The state-action-Q-value sets, from the sequence in on(c,table). on(c,table). The algorithm uses the TILDE-RT system (described in Figure 1, given as examples in relational format. This is part 3) to induce the Q-tree. Because TILDE-RT is not exactly the input that would be given to TILDE-RT. incremental, a set of all examples encountered over all training sessions is maintained, with the most recent Q- goal_on(A,B),goal_on(B,C),goal_on(C,table), numberofblocks(D) value for each one, and used to reinduce the entire tree on(A,B) ? after each episode. +--yes: [0.0] +--no: clear(A) ? An example for a state-action-Q-value set is represented +--yes: [1.0] +--no: clear(B) ? as a set of relational facts, including all the facts of the +--yes: [0.9] state itself, the relational form of the action, and the Q- +--no: [0.81] value. State-action-Q-value sets are translated into examples in this format in order to induce the tree, and [Three comments about the above figure: You should then a state-action pair can be translated into example explain what the goal_on( . ) formalism means and how format in order to retrieve the corresponding Q-value it is used. (Is it different than goal(on( , ))?) And, it from the tree during a training session. Figure 1 seems there are some lines missing from the figure? demonstrates a sequence of state-action-Q-value sets Also, wouldn‟t this be a good place to mention the encountered during a training session in the blocks Prolog convention of capital letters for variables and l.c. world domain, and Figure 2 shows how these sets are for constants? 6 ] else generate example (sj, ak, c) Figure 3: The Q-tree induced from the examples in Figure 2. where c = 0 update Pe using TILDE to produce Pe + 1 using If the state is a goal state, the Q-value is automatically 0 these examples (sj, ak, c) regardless of the action; goal states are absorbing, so all actions that would cause the agent to leave the goal state Again, all actions from a goal state are assigned a label must be considered nonoptimal (and thus have an of nonoptimal, for the reason discussed above. absolute quality of 0). The same sequence of state-action-Q-value sets from From Figure 3, it is apparent that the logical regression Figure 1 is given in Figure 4, but in the format that tree used in relational reinforcement learning differs would be given as input to the TILDE system in order to slightly from the definition of logical decision trees induce the logical classification P-tree. The only given in Part 3, in that it contains a root node with facts difference is that the classifier label “optimal” or that are guaranteed to be true for all possible examples. “nonoptimal” is given, rather than a Q-value. Figure 5 [Which is the root node in Fig. 3?] This is necessary in illustrates the corresponding P-tree. order to bind the variables of the tree to the appropriate constants, and this binding is then propagated to all optimal. optimal. nodes of the tree. Because we can bind the variables in action(move(c,table)). action(move(a,b)). the root node to any set of appropriate constants, the goal(on(a,b)). goal(on(a,b)). learned tree can be used as a solution to any problem of goal(on(b,c)). goal(on(b,c)). goal(on(c,table)). goal(on(c,table)). the same form, regardless of the actual names of the clear(c). clear(a). blocks. This is one of the most significant on(c,b). clear(b). improvements of the relational reinforcement learning on(b,a). on(b,c). on(a,table). on(a,table). over traditional Q-learning. on(c,table). optimal. The P relational reinforcement learning algorithm action(move(b,c)). nonoptimal. goal(on(a,b)). action(move(a,table)). goal(on(b,c)). goal(on(a,b)). Just as traditional P-learning is almost identical to Q- goal(on(c,table)). goal(on(b,c)). learning, the P-learning counterpart to relational clear(b). goal(on(c,table)). reinforcement learning introduces very few distinctions. clear(c). clear(a). Because our definition of the P-function depends upon on(b,a). on(a,b). on(a,table). on(b,c). the Q-function, we must learn both a Q and P-tree. Figure 4: The state-action-P-value sets, from the sequence in on(c,table). on(c,table). However, the P-tree will in most cases prove to be Figure 1, given as examples in relational format. It is identical simpler, and thus a more efficient method for choosing to the input in Figure 2, but with classifier labels rather than actions. The Q-tree is only learned in order to define the Q-values. This is exactly the input that would be given to TILDE. P-tree; it is not used to select actions. [It isn‟t exactly clear how the P tree is learned. I take it goal_on(A,B),goal_on(B,C),goal_on(C,table), numberofblocks(D) that a Q tree is learned first, and then it is used to on(A,B) ? construct training samples for the P-tree, which is then +--yes: [nonoptimal] learned from the samples by TILDE?] +--no: [optimal] [Is there something missing from the above figure? The algorithm is identical except for three differences: Where is the action mentioned?] the initial P-tree assigns a label of “optimal” to all state- action pairs, actions are chosen using the P-exploration Figure 5: The P-tree induced from the examples in Figure 3. strategy given in part 2, rather than the Q-exploration strategy, and the following loop is added after the Q- 5. Hierarchical relational reinforcement function update loop, to update the P-function. Again, this psuedocode was taken from [Džeroski et al., 2001]: learning for j = i – 1 to 0 do [Džeroski et al., 2001] thoroughly describe the method for all actions ak possible in state sj do of relational reinforcement learning as described in part if state-action pair (sj, ak) is optimal 4. They also offer a set of thorough experimental results according to Qe + 1 testing the effectiveness of the method at planning in the then generate example (sj, ak, c) blocks world domain. One of their most significant where c = 1 findings is that relational reinforcement learning works 7 extremely well on very simple goals, but becomes less background knowledge. In a more advanced effective at solving even slightly more complex implementation, the agent may learn what subgoals to problems. For the goals of unstacking all the blocks use, as well as the solutions to the subgoals themselves. onto the table, or building all the blocks into a simple tower (in any order), the method gives very impressive Each subroutine policy is learned independently using convergence rates to an optimal solution. However, it relational reinforcement learning, which produces a converges more slowly with the goal of placing one (presumably optimal) Q or P-tree. The agent that is specific block on top of another specific block. [I don‟t learning the larger goal then creates instances of each understand why putting one block on top of another subroutine by instantiating the variablized root nodes of should be harder than building a tower. Is it because each tree once for each possible binding. In our blocks that specific blocks are named in the former task?] world domain, for example, if the subroutine is makeClear(A) and there are three blocks in the world Our implementation of relational reinforcement learning named a, b, and c, the agent would create three seems to support this conclusion. We attempted to learn instances: a more complicated goal: building all the blocks in the world into a single ordered tower. [Shouldn‟t you say makeClear(a) that “all the blocks in the world” amounted to only 3 or makeClear(b) 4 blocks?] We found great difficulty producing optimal makeClear(c). solutions, or even improving the accuracy rate while learning. Part 6 of this paper of gives the results of these [I‟m beginning to worry a bit about the method. If we preliminary experiments, and offers some discussion. used all instances of each subroutine, that would amount to a lot of specific actions to consider in big This relative lack of success in solving more complex problems. Maybe we can talk a bit about how to get problems suggests that the benefits of hierarchical around this.] learning in a relational reinforcement learning context could be quite profound. By learning first a set of useful (We use the Prolog convention, which uses uppercase routines for accomplishing subgoals, and then using characters for variables and lowercase characters for these subroutines as the available action set to learn the constants. [This convention should be mentioned original goal, the agent could significantly reduce the earlier.]) For a subroutine makeOn(A,B) with three number of training sessions required to converge upon blocks, the agent would create nine instances: an optimal policy, and in the end could produce a simpler representation of the optimal policy. Figure 6 makeOn(a,b) illustrates schematically how this idea could be makeOn(a,c) approached in the blocks world domain. In order to test makeOn(a,table) this hypothesis, we have developed a simple makeOn(b,a) implementation strategy for hierarchical relational makeOn(b,c) reinforcement learning. makeOn(b,table) makeOn(c,a) makeOn(c,b) Main Goal: makeOn(c,table). orderedStack(a,b,c) For a subroutine with no arguments, such as unstack, a single instantiation is created. Subgoal: Subgoal: Subgoal: clear(A) on(A,B) unstack Note that the instantiation of subroutines is only possible because relational reinforcement learning Figure 6: The main goal in the ordered stacking task can be produces variablized trees, so one solution can be used accomplished by achieving some combination of subgoals in a to solve many problems. Past attempts at hierarchical particular order. It is this structure of a learning problem that reinforcement learning have been slowed without this hierarchical learning hopes to exploit. feature. An implementation of hierarchical relational After instantiating all subroutines, the agent then uses reinforcement learning the available subroutines as its set of possible actions while learning a policy for achieving the main goal. We In this simple implementation, the choice of subgoals define the preconditions of a subroutine to be simply that will be learned is supplied to the agent as a form of that its subgoal is not already accomplished in the given 8 state. Thus subroutines have the power to be much more another action out of state si would be chosen is if there effective than simple primitive actions, as they can exists a shorter path to the goal that does not pass make a particular subgoal true, in the fewest number of through si + 3. steps, starting in any state. Using this modified Q update function, we can run the In the Q-learning process, subroutines, like primitives, relational reinforcement learning algorithm as usual, but are assigned Q values that reflect the quality of taking using subroutines to improve the efficiency of the that action in a particular state. (The P-function is learning. defined as usual using the Q-function). However, the original Q update function proves to be inaccurate when 6. Experiments used with subroutines. Consider the example depicted in Figure 7. In order to test the hypothesis that hierarchical relational reinforcement learning overcomes some of the subroutine a ineffectiveness of relational reinforcement learning on more complicated goals, we conducted a set of experiments aimed at comparing the accuracy over time [0.81] of the learned regular policies and the learned si si+1 si+2 si+3 si+4 hierarchical policies. (primitive actions taken Experimental setup by subroutine a) Figure 7: A possible sequence of states visited by the agent We conducted two sets of experiments. In the each set during a training session. The actions explicitly taken by the we performed 10 total runs of 100 training sessions each agent are designated by the bold arrows, while the primitive in the 3-blocks world, 5 runs using very limited actions taken by the subroutine are shown with dotted arrows. background knowledge, and 5 runs using a more extended body of background knowledge. In the first set Usually the state action pair (si, a) would be assigned a of experiments, we did not use hierarchical learning, new Q-value as follows: and in the second set we did. [NOTE: Some runs take over 10 hours to complete. Due to time constraints, newQ = rj + maxa’ Qe (sj + 3, a’), experiments were only conducted in the 3-blocks world. The results from experiments conducted in the 4 and 5- where maxa’ Qe (sj+3, a’) in this case is 0.81 and rj is 0, blocks world are forthcoming.] [You might comment so newQ would be 0.72 (with a discount factor of 0.9). here that your interfacing of code from others with your But this is misleading, because it implies that si is much java code was the main reason the runs took a long time closer to the goal than it actually is. In later training to complete.] sessions, the agent will be unfairly biased towards choosing a path through state si. The current state was described using the following primitive predicates: In order to preserve the accuracy of the Q-function, we remember the number of primitive moves taken by the clear(A): true if block A has no other blocks on subroutine from state si, say n, to its next state si + 3, and top of it we calculate the newQ with the following modified on(A,B): true if block A is directly on top of B. update equation: B may be a block or the table. newQ = rj + n maxa’ Qe (sj + 3, a’), The goal task was to stack a set of blocks in an ordered tower. Specifically, the optimal solution for each where n is the number of primitive moves taken by the number of blocks was: subroutine just completed. Because ordinary primitive moves trivially contain only one primitive (n = 1), the 3 blocks: on(a,b), on(b,c), on(c,table). modified function works exactly the same as the 4 blocks: on(a,b), on(b,c), on(c,d), on(d,table). original function for the case of primitive actions. 5 blocks: on(a,b), on(b,c), on(c,d), on(d,e), Since we assume the subroutines have already been on(e,table). learned to be optimal, this is effectively a shortcut on the learning path. Because the subroutine is definitely For the non-hierarchical learning experiments, the the shortest path between states si and si + 3, the only way available action set consisted of all possible forms of the 9 primitive move(A,B), which may only be executed if where f = (total number of training sessions in the both A and B are clear. The second argument can be the current run) / 10. We also used the TILDE and TILDE- table. RT systems, as part of the ACE Data Mining System, [Blockeel et al., 2001], provided by Hendrik Blockeel at For the hierarchical learning experiments, we first Katholieke Universiteit Leuven, Belgium. learned optimal policies for the following subroutines: Experiments conducted with a limited body of unstack: goal reached if all blocks in the world background knowledge provided to TILDE and TILDE- are on the table RT used only the following pieces of informations: makeOn(A,B): goal reached if block A is on block B eq(X,Y): returns true if blocks X and Y are the same block The subroutine makeClear(A), although it is useful in block(X): a variable is considered a block (as achieving the goal of an ordered stack, is itself a opposed to the table) if and only if it is found subroutine of makeOn(A,B), and thus was omitted. [I as the first argument in some fact on(X,Y). don‟t understand---did you use a 3-level hierarchy?] Figures 8 and 9 give the optimal learned P-trees used for Experiments conducted with the more extended body of each subroutine. The instantiations of each of these background knowledge used essentially the same subroutines then became the possible action set for information used by [Džeroski et al., 2001]. These accomplishing the goal task above. included: goal_unstack,numberofblocks(A) above(X,Y): returns true if block X is above action_move(B,table) ? block Y in some stack. +--yes: [optimal] +--no: [nonoptimal] height(X,N): returns true if block X is at height N above the table. Figure 8: An optimal P-tree for the subroutine unstack, numberofblocks(N): returns true if there are N learned after 10 training sessions. blocks in the world. numberofstacks(N): returns true if there are N goal_on(A),numberofblocks(B) stacks in the current state. action_move(D,table) ? +--yes: clear(A) ? diff(X,Y,Z): returns true if X – Y = Z. | +--yes: clear(B) ? | | +--yes: [nonoptimal] Note that this information only affects the learning | | +--no: on(A,B) ? | | +--yes: on(B,table) ? process in that it is made available to TILDE during the | | | tree induction process. It does not affect the way states +--yes: [nonoptimal] | | | +--no: [optimal] are represented or how actions are carried out. | | +--no: [optimal] | +--no: [optimal] +--no: action_move(A,B) ? Experimental results +--yes: [optimal] +--no: [nonoptimal] [Should goal_on(A) in the above be goal_on(A,C) or The results of the 3-block runs with limited background something? And shouldn‟t there be more to this knowledge, and extended background knowledge are figure?] given in Figures 10 and 11, respectively. The accuracy of the learned P function for hierarchical and non- hierarchical learning is plotted over time. The accuracy Figure 9: An optimal P-tree for the subroutine makeOn(A,B), is defined as the percentage of state-action pairs that are learned after 51 training sessions. correctly classified as optimal or nonoptimal using the learned P function. The accuracy of all 5 runs of Our implementation uses the equations and algorithms hierarchical learning is averaged and plotted after each described in parts 2–5 of this paper. We used a discount training session. The equivalent representation for non- factor of 0.9 for the Q update function and a starting hierarchical learning is given on the same graph in order temperature T of 5 for the Q and P exploration to illustrate the contrast. functions. We used the following equation as a schedule for temperature decrease, given a factor f and a current Error! Not a valid link.Figure 10: Hierarchical and non- training session number e: hierarchical learning in the 3-blocks world, with limited background knowledge. T (f, e) = f / (e + f), 10 Error! Not a valid link. Figure 11: Hierarchicalgoal_on(A,table),goal_on(B,A),goal_on(C,B) and non- extended hierarchical learning in the 3-blocks world, with action_makeOn(A,D) ? background knowledge. +--yes: [nonoptimal] +--no: clear(B) ? [Whoops---no figures] +--yes: on(A,C) ? | +--yes: [nonoptimal] | +--no: action_makeOn(B,A) ? [NOTE: Again, due to time constraints, only one| +--yes: clear(A) ? execution of the hierarchical learning with extended| | +--yes: [optimal] background knowledge was completed. The accuracy of | | +--no: [nonoptimal] | +--no: action_makeOn(C,table) ? this one run is plotted against the average accuracy of| +--yes: clear(C) ? the non-hierarchical case. More statistically sound| | +--yes: [optimal] results are forthcoming.] | | +--no: [nonoptimal] | +--no: on(B,A) ? | +--yes: action_makeOn(C,B)? The first observation we can make from the results in| | +--yes:[optimal] both Figures 10 and 11 is that relational reinforcement| | +--no: [nonoptimal] learning, without hierarchical learning, makes very little| +--no: [nonoptimal] action_makeOn(B,A) increase in accuracy over time. Whether or not extended+--no: +--yes: on(C,table) ? ? background knowledge is used, the average accuracy of | +--yes: [nonoptimal] the learning program seems to hover around 80% | +--no: [optimal] accuracy. This contributes more evidence to the +--no: on(C,table) ? +--yes: action_makeOn(C,A) ? conclusion drawn by [Džeroski et al., 2001], that | +--yes: [nonoptimal] relational reinforcement learning becomes much less | +--no: [optimal] effective when used on slightly harder problems. +--no: on(A,table) ? +--yes: action_makeOn(C,R) ? | +--yes: on(B,R) ? Second, From Figure 10, we see that hierarchical | | +--yes: [optimal] learning, with limited background knowledge, contrary [Adjust figure to fit in column.] | | +--no: [nonoptimal] to our hypothesis, in fact makes very little improvement | +--no: [nonoptimal] +--no: [nonoptimal] Figure 13: An optimal P-tree for ordered stacking, learned with respect to the non-hierarchical version. The after 45 training sessions with hierarchical learning and individual runs in both cases do at times hit 100% limited background knowledge. accuracy, but the averaged results displayed on the graph demonstrate that this does not occur consistently. Perhaps, however, this lack of improvement is due to However, we can give examples of optimal P-trees for the limited background knowledge available. We turn to both hierarchical and non-hierarchical learning Figure 11, the results of the preliminary experiments produced at these inconsistent occasions; these are conducted with extended background knowledge, to shown in Figures 12 and 13. determine if this is true. goal_on(A,table),goal_on(B,A),goal_on(C,B) While the results of the hierarchical run illustrated in action_move(D,table) ? +--yes: eq(D,B) ? Figure 11 is by no means statistically sound, it suggests | +--yes: clear(C) ? that the results of more thorough tests would be very | | +--yes: [nonoptimal] similar to the hierarchical learning performed with | | +--no: clear(A) ? limited background knowledge. In both cases, the | | +--yes: [nonoptimal] | | +--no: [optimal] hierarchical learning does not seem to show significant | +--no: [optimal] improvement over non-hierarchical learning. [It‟s +--no: action_move(B,A) ? possible (probable?) that the lack of improvement is due +--yes: on(A,C) ? to the fact that the problem is too small for hierarchical | +--yes: [nonoptimal] | +--no: [optimal] learning to be needed.] +--no: on(B,A) ? +--yes: action_move(C,B) ? 7. Conclusions and further work | +--yes: [optimal] | +--no: [nonoptimal] [Adjust figure to fit in column.] +--no: [nonoptimal] Naturally the first and most important avenue for further work will be to complete the experimental setup Figure 12: An optimal P-tree for ordered stacking, learned described here, first for the remainder of the 3-block after 24 training sessions with non-hierarchical learning and hierarchical runs with extended background knowledge, limited background knowledge. and then to repeat the entire set of experiments in the 4 and 5-blocks world. It is possible that as the state space and the complexity of the problem increases, the 11 contrast between hierarchical and non-hierarchical acknowledgements are due to Hendrik Blockeel and accuracy will become more pronounced. In addition, it Kurt Driessens at Katholieke Universiteit Leuven for is possible to observe a very slight but perceptible their patience in answering my questions regarding the increase in the average accuracy of all the learning TILDE system, and for making the system available to methods over the course of the 100 training sessions. me for this project. With enough time and computing power, it would be interesting to learn how the graph appeared after 500 or References 1000 training sessions. [Blockeel, 1998] Blockeel, H. Top-down Induction Of However, it may be the case that our simple strategy for First Order Logical Decision Trees. Department of hierarchical learning is not sophisticated enough to Computer Science, Katholieke Universiteit Leuven, significantly improve the overall performance of the Belgium, December 1998. relational reinforcement learning algorithm. This possibility warrants further research into more [Blockeel et al., 2001] Blockeel, H., Dehaspe, L., and sophisticated hierarchical methods: using recursion, Ramon, J. “The ACE Data Mining System: User‟s creating nested hierarchies, taking greater advantage of Manual.” Department of Computer Science, the structure of the problem, or working more closely Katholieke Universiteit Leuven, Belgium, October with the available background knowledge. 2001. Finally, further work into ways of improving relational [Blockeel and De Raedt, 1997] Blockeel, H. and De reinforcement learning in general would be fascinating. Raedt, L. “Top-down Induction of Logical Decision For example, as stated above, the program at times Trees.” Department of Computer Science, Katholieke stumbles upon a completely optimal Q or P-tree, and Universiteit Leuven, Belgium, January 1997. then, due to a high temperature value, wanders away from the optimal solution. [De Raedt and Blockeel, 1997] De Raedt, L. and [I‟ll bet that the annealing schedule needs to be Blockeel, H. “Using Logical Decision Trees For accelerated. That is, the temperature ought to be Clustering.” Proceedings of the Seventh International decreased more rapidly.] It would be interesting to Workshop on Inductive Logic Programming, 133-141, investigate ways of introducing a type of evaluation 1997. function, similar to that used in genetic programming, which would greatly decrease the temperature or even [Džeroski et al., 1998] Džeroski, S., De Raedt, L., and halt the program when an optimal or near-optimal Blockeel, H. “Relational Reinforcement Learning.” solution is found, even if it is early in the run. Proceedings of the Fifteenth International Conference on Machine Learning, 136-143. Morgan Kaufmann, Relational reinforcement learning offers great potential: 1998. it combines two extremely powerful machine learning techniques, and uses each one to overcome the other‟s [Džeroski et al., 2001] Džeroski, S., De Raedt, L., and drawbacks. Similarly, a hierarchical approach to the Driessens, K. “Relational Reinforcement Learning.” learning problem appears as though it would exploit the Machine Learning, 43:7-52. The Netherlands: Kluwer benefits and avoid the weaknesses of relational Academic Publishers, 2001. reinforcement learning. Despite the rather unpromising results of the simple and small-scale experiments [Finney et al., 2002] Finney, S., Gardiol, N., Kaelbling, conducted here, I believe that relational reinforcement L. P., and Oates, T. “Learning with Deictic learning warrants a great more attention and thought, if Representation.” Massachusetts Institute of only to discover more concretely its specific strengths Technology, Artificial Intelligence Laboratory, and weaknesses. Cambridge, MA, April 2002. Acknowledgements [Kaelbling et al., 1996] Kaelbling, L. P., Littman, M., and Moore, A. “Reinforcement Learning: A Survey.” This work has been supported by the Computer Science Journal of Artificial Intelligence Research, 4:237-285, Undergraduate Research Internship program at Stanford 1996. University. A special thanks to Professor Nils Nilsson for his boundless support, encouragement, and [Nilsson, 1996] Nilsson, N. J. Introduction To Machine enthusiasm, and to my research colleagues Mykel Learning. Department of Computer Science, Stanford Kochenderfer and Praveen Srinivasan for their valuable University, CA, September, 1996. Unpublished. commentary on my work. Also, grateful 12