VIEWS: 313 PAGES: 137 CATEGORY: Engineering POSTED ON: 12/17/2009
Robot Learning is a currently very active research area as it promises to create more flexible, independent and autonomous robots. It also opens the possibility, for non-experts, to instruct a robot how to perform different tasks without the need of programming. In this short tutorial we will review some of the current main techniques used in robot learning and show some of their applications. In particular, we will focused this tutorial in reinforcement learning and programming by demonstration, but we will also review some techniques used in visual concept learning and other machine learning techniques that can be used in robotics.
Robot Learning Eduardo Morales INAOE – Puebla emorales@inaoep.mx Contents • Motivation • Most common tasks and ML techniques • Conclusions and current challenges Motivation • There is an increasing use of service robots for different tasks (tendency towards humanoids) What to expect? Service robot will become at least as important as industrial robots The service robot industry will become as important as the car industry They will become as common as laptops are today Hugh investment in counties like Japan, Korea, European Union and USA Expected growth Why Robot Learning? Why Robot Learning? • In order to become widely accepted, robots require to be flexible, autonomous, and capable of adaptating to their circunstances and Machine Learning may be the only way to do it • Machine Learning can also ease the burden of programming and allows the instruction of robots by non-experts Challenges It is necesary an adequate use/interpretation of visual, tactile, range, …, stimulae and an adequate selection of actions We need to estimate task parameters, adequately use primitives, recognize goals, … We also want learning to be fast (preferably online), smooth, to recover from errors, able to deal with few samples and/or large amount of data, not too sensitive to errors from demonstrations, … Robot Learning Robot learning can occur at different levels (direct mapping between perception and action up to planning and learning of complex tasks) and with different ML strategies (supervised, semi-supervised, unsupervised, reinforcement) Most Common ML Techniques • Reinforcement learning with Programming by demonstration (or behavioral cloning) • AAAI-2010 Learning by Demonstration Challenge • Visual concept learning • AAAI-2010 Semantic Robot Vision Challenge • Other useful techniques: • Active learning • Hierarchical learning • Different regression techniques • Clustering • … Robot Tasks • • • • • • Navigation Interaction Recognition Grasping and manipulation Combined skills … Reinforcement Learning In reinforcement learning an autonomous agent follows a trial-and-error process to learn the optimal actions to perform in each state to reach its goals Markov Decision Process An MDP is described by: A finite set of states, S A finite set of actions per state, A A transition model, P (s’ | s, a) A reward function for each state, r(s) or state-action pair, r (s, a) Example Initial State Transition Model There is uncertainty in the outcome of an action The transition model gives us the probability of reaching state s’ given that we perform action a in state s: P(s’|s,a) The transition probabilities depend on the actual state (Markovian) and are stationaries Uncertatinty in actions Uncertainty in states POMDP Reward -1 Example – reward function -1/25 -1/25 -1/25 -1/25 -1/25 -1/25 -1/25 -1/25 -1/25 +1 -1 Finite vs. Infinite Horizon Finite Horizon: finite number of steps Infinite Horizon: undetermined number of steps In many practical problems, and in most robotic tasks, it is considered an infinite horizon In this case the accumulated reward is evaluated as: Value Functions • The value function (V or Q) is the expected value of the accumulated reward obtained when following a particular policy, under a finite horizon or an infinite horizon Utilities 0.812 0.762 0.868 0.912 0.660 0.655 0.611 0.338 0.705 Utility Considering a separable (additive) utility, it can be obtained from the current reward plus the utility of the next state: Optimal Value Functions • Bellman equations: Optimal Policy Given a transition model, the objective is to find an optimal policy to maximize the expected utility The policy gives the best action to apply in each state Optimal Policy Initial State Solution Techniques There are two types of basic algorithms • Dynamic programming techniques: assume a known model (transition and reward functions) that when solved obtains an optimal policy (value iteration, policy iteration, linear programming) Monte Carlo and reinforcement learning: the model is not known and the solution is obtained exploring the environment Exploration vs. Exploitation • • • When selecting actions, without knowing the model, there are two conflicting objectives: – Obtain immediate rewards from known good states (exploit) – Learn new state/action/reward relations with new actions going to new states (explore) Balance between immediate rewards and long term rewards There is normally a gradual shift between exploration and exploitation Seletion action stategies • є-greedy: selects most of the time the action that produces the greatest benefit, but with probability є selects a random action • Softmax: changes the selection probabilities according to the estimate value, the most popular being the Boltzmann distribution Temporal Difference Methods • TD(0) is the simplest for value functions: • SARSA for Q values: • Q-learning for Q values: TD(0) Initialize V(s) arbitrarily Repeat (for each episode): Initialize s Repeat (for each step in episode): a ← action given by π for s Take action a; obtain reward, r and next state s’ V(s) ← V(s) + α [r + γV(s’) – V(s)] s ← s’ Until s is terminal SARSA Initialize Q(s,a) arbitrarily Repeat (for each episode): Initialize s Select as using policy derived from Q (є -greedy) Repeat (for each step in episode): Take action a; observe r and s’ Select a’s’ using policy derived from Q (є -greedy) Q(s,a) ← Q(s,a) + α [r + γQ(s’,a’) – Q(s’,a’)] s ← s’, a ← a’ Until s is terminal Q-Learning Initialize Q(s,a) arbitrarily Repeat (for each episode): Initialize s Repeat (for each step in episode): Select as using policy derived from Q (є -greedy) Take action a; observe r and s’ Q(s,a) ← Q(s,a) + α [r + γ maxa’Q(s’,a’) –Q(s’,a’)] s ← s’ Until s is terminal Now suppose we want to learn how to ﬂy ... from our interactions with the environment ... ... • Challenges: a large number of continuous variables, an “infinite” space and large areas without a clear reward RL Problems • Long training times • Continuous state and action spaces • Non transferable policies Some proposed solutions • Update several states at a time, learn and use a model, state/action abstractions, hierarchies, reward shaping, function approximation, provide human traces, … Eligibility Traces • Eligibility traces are between Monte Carlo and TD as they consider several posterior rewards (or affect several previous states) • TD simple use: • We can easily consider two (or more) known rewards: Eligibility Traces • In practice, instead of waiting n steps to update (forward view), a list of visited states is stored to update new states (backward view) discounted by distance • For TD(λ): • For SARSA(λ): SARSA(λ) Methods with/without models Model-free methods learn a policy without trying to estimate the expected rewards and the transition probabilities (Q-learning) Model-based methods try to learn an explicit model and derive a policy using this model (converge faster but require more memory) Planning and Learning Given a model, it can be use to predict the next state and its associated reward The prediction can be a set of states with their rewards and probability of occurence With the model we can plan and learn! To the learning system it doesn’t matter if the state-action pairs are real or simulated Dyna-Q Dyna-Q combines the experience with planning to converge faster The idea is to learn with experience but also learn and use a model to simulate and generate new experience Dyna-Q Initialize Q(s,a) arbitrarily s ← current state, a ← є –greedy(s,Q) Take action a; observe r and s’ Q(s,a) ← Q(s,a) + α [r + γ maxa’Q(s’,a’) – Q(s,a)] Model(s,a) ← s’,r Repeat N times: s ← random state previously visited a ← random action previously taken in s s’,r ← Model(s,a) Q(s,a) ← Q(s,a) + α [r + γ maxa’Q(s’,a’) – Q(s,a)] Dyna-Q Prioritized Sweeping The planning can be more effective if it is focussed on particular state-action pairs We can work from goals backwards or from any state with important changes in their values We can prioritized the updating according to a relevance measure Prioritized Sweeping The idea is to keep a queue, evaluate the previous states to the first element and add them to the queue if their change in value is greater that a certain threshold Works like Dyna-Q but updates all the states which Q-values change more than a certain threshold value Function Approximation So far, we have assumed that states (and state-action pairs) have an explicit tabular representation This works for small/discrete domains, but is impractical for large/continuous domains One possibility is to implicitly represent the value function (V o Q) Function Approximation A value function can be represented with a weighted lineal function (e.g., chess) Evaluate the error Use gradient descent Function Approximation • We do not know but we can estimate it • We need an estimator and one possibility is to use Monte Carlo. • Alternatively we can use TD of n-steps as an approximation of (although it does not guarantee convergence to a local optimal) Function Approximation With eligibility traces we can approximate the parameters as follows: Where δ is the error And et is a vector that is updated with each component of Θ Function Approximation We can use linear functions The attributes can be overlapping circles (coarse coding) Partitions (tile coding), not necessarily uniform Basis radial functions Prototypes (Kanerva coding) Locally weighted regressions (LWR) Gaussian processes Regressions In robotics most information uses continuous variables There are different techniques to learn with continuous variables Some recent commonly used techniques are: Locally weigthed regression (LWR) Gaussian processes LWR Model-based methods (e.g., NN, mixture of Gaussians), use the data to find a parameterized model Memory-based models store data and use it each time a prediction is needed Locally weighted regression (LWR) is a memory based method that uses a local weighted (by distance – Gaussian Kernel) regression around the interest point LWR X LWR Is a fast method with good results Requires to store the examples and fit a distance function (in case of a Gaussian kernel, with a small standard deviation we can remove some examples, otherwise, we need to keep them) Gaussian Processes A Gaussian process is a collection of random variables with a joint Gaussian distribution It is specified by a mean and a covariance function The random variables represent the value of the function f(x) at point x Graphically a GP Without information With data Graphycally a GP Adjustment of Parameters Parameters of the covariance function Relational Abstractions To deal with a large search space, use a state abstraction through a relational representation • Easy to represent powerful abstractions • Can incorporate domain knowledge • Can re-use learned policies on other similar problems King – Rook vs. King > 150,000 (positions) states & 22 actions per state Equivalent positions - actions Application to Robots bed chairs window stage Relational Representation for RL • States are represented as sets of relations (r-states) kings_in_opposition(State) and rook_divides(State) and not rook_threatened(State) ... • An r-state can cover a large number of states Relational Representation (2) • Actions are represented in terms of those relations (r-actions) • Syntax: –pre-conditions: set of relations –(g-)action (generalized action) –(post-conditions: set of relations) Example r_action(State1, State2) ← rook_divs(State1) and opposition(State2) and move(rook,State1,State2) and not_threatened(rook,State2) and l_shaped_pattern(State2) Learning of Q(S,A) As Q-learning, but the space is characterized by: • r-states (abstract states described by a set of properties of states) • r-actions (that consider these properties) choose A from S using e.g., є-greedy • Updates Q-values over r-states and r-actions • Can produce sub-optimal policies Some experiments Faster convergence 5 x 5 grid 10 x10 grid Re-use of policies Program by demonstration The ides is to show the robot a task and let the robot learn to do it based on the demonstration (example) Approaches: I guide the robot (also behavioural cloning) I tell the robot I show the robot (imitation) Flight simulator • High fidelity flight model of a high performance aircraft provided by DSTO • 28 variables, most of them continuous • Turbulence can be added as a random offset to the velocity components rQ-learning for flying Assume the aircraft is in the air, with constant throttle, flat flaps and retracted gear • Divide the task into two RL problems: 1) Learn to move stick left-right (aileron) 2) Learn to move stick forward-backward (elevation) • Define relations for each task • Use behavioural cloning to learn r-actions for each task • Training mission target1 target4 Human trace Relations and actions • Rels. Elevation: dist_target, elevation • Rels. Aileron: dist_target, orient_target, plane_rol, plane_rol_rate • Acts. Aileron: far_left, left, nil, right, far_right • Acts. Elevation: farfrwd, frwd, nil, back, farback R-actions for Elevation r_action(Id,el,State,move_stick(Move)) :distance_target(State,Dist), elevation_target(State,Angle), move_stick(Move). • There are 75 possible r-actions R-actions for Aileron r_action(Id,al,State,move_stick(Move)) :distance_target(State,Dist), orientation_target(State,Angle1), plane_rol(State,Angle2), plane_rol_rate(State,Trend), move_stick(Move). • There are 1,125 possible r-actions Behavioural cloning Two stages: • Induce r-actions from traces of flights (180 r-actions for aileron and 42 for elevation in 5 traces) • Use the learned r-actions to explore and learn new r-actions until no more learning (359 r-actions for aileron and 48 elevation after 20 trials) Exploration mode • Randomly choose an r-action on each r-state • If unseen state ask the user and learn a new r-action • The user can also choose an alternative action at any time • In total an average of 1.6 r-actions per aileron state and 3.2 per elevation state Experiments with RL 1. R-actions from BC + turbulence 2. 1 without turbulence 3. 1 with only r-actions from original traces 4. 1 with ALL possible r-actions 5. 4 with initial guidance (using original traces to seed initial Q values) Learning curves for aileron 1. BC+Expl+turb. 2. BC+Expl-turb. 3. BC-Expl+turb. 4. All r-actions 5. All+guidance Learning curves for elevation Discussion • In all the experiments, after certain initial (and fast) improvement, there is no more improvement • Without behavioural cloning RL is unable to learn even with guidance • The exploration mode of BC also proved to be useful for completing the set of r-actions Performance of RL with turbulence Clon Human Performance on new mission Human Clon Flight tests…. Discussion • Learning with turbulence produces more robust controllers • Behavioural cloning focuses the search on potentially relevant r-actions • We can learn from different experts and let RL to choose the best actions Robot Navigation Use traces and transform lowlevel sensor infomation into a high-level representation Create high-level traces Continuous action policy • Learn a discrete action policy and then use LWR to produce a continuous action policy Discrete vs. Continuous Discrete vs.Continuous • Faster, smoother and closer to expert, safer Convergence times Differences with user’s traces Execution times Safer Inverse Reinforcement Learning To speed-up convergence times some people carefully define a reward function (reward shaping) Inverse reinforcement learning: The idea is to use human traces to derive a reward function. It is assumed that the reward function is expressed as a linear combination of state features and that the traces provided by the user represent different policies The idea is to find a reward function that produces a policy similar to the user (combination of user policies) Some examples Para ver esta película, debe disponer de QuickTime™ y de un descompresor . Para ver esta película, debe disponer de QuickTime™ y de un descompresor mpeg4. Program by demonstration Steps: Observation: normaly in a controlled environment Segmentation: different movements are identified Interpretation: the movements are intepreted Abstraction: an abstracted sequence is generated Transformation: this sequence is transformed into the robot’s dynamics Simulation: it is tested on a simulated environment Execution: it is tested on the real enirnoment Learning from Imitation • Practice motor skills Learning from Imitation • To program others Learning from Imitation • Learning behavior and social acceptance Problems • Different perspectives • Different bodies Program by demonstration The main problem is the corresponding problem (how to map user actions into robot actions) Most people use sensors in the arm/hand This simplifies the problem, as there is no need for visual interpretation, but it does not elliminate it Program by demonstration Examples of special devices Examples Examples Examples Visual Machine Learning Steps: Segment an object from images (not always) Extract a set of attributes (color, shape, texture – SIFT, Harris, SURF, RIFT, …) Learn a model with a classifier Use the classifier for: People/face/object/place recogition Process and Segmentation Face/skin detetion and tracking • AdaBoost on Haar attributes (Viola & Jones) • Color – Skin • Tracking using a window around the object Come Attention Right Left Stop Gesture Recognition • Feature extraction • Modeling • Recognition M Images FE Features R Gesture Find face and hand Tracking and execution of commands Hand tracking Gesture recognition Human Tracking Para ver esta película, debe disponer de QuickTime™ y de un descompresor . SIFT Scaled-Invariant Feature Transform (SIFT) transforms an image into a large collection of feature vectors, each of which is invariant to image translation, scaling, and rotation, partially invariant to illumination changes and robust to local geometric distortion SIFT Uses differences of gaussians at different scales Stores information of these points for classification in new images Object Recognition in Robots • Show the robot an object, extract features and search the object in a map Para ver esta película, debe disponer de QuickTime™ y de un descompresor . People Recognition Feature extraction Localization and tracking Recognition Accumulate evidence Results People Recognition Example in the robot Problems with movement, posture, occlusion, illumination conditions, … Visual Map • Incorporate visual features into maps based on SIFT (or other features) and used them for localization Clustering Segmentation of the map Feature extraction Clustering Cluster Centers Identified Adjecent Regions Analisis of regions Add Visual information on nodes Topological Map Segmented Map Localization and Mapping example People Identification • Silloutes-based recognition • Distance-based segmentation using a stereo camera • Identifies standing, sitting and sideways people Semantic robot vision challenge • Given a list of objects find them in an unknown environment • The robots acquire data from Internet about these objects and learn a classifier • Search the objects Semantic robot vision challenge • Requires filtering and ranking • Other sources (such as, LabelMe, WalMart catalog, …) • Concept learning • Searching strategies Hierachical Learning In many machine learning schemes, and in particular in learning tasks, it is natural to decompose the tasks in subtasks One possibility is to learn a hierarchy of tasks There is some work on RL, but also in other ML techniques In general, the user selects the order in which to learn the concepts and/or the hierarchy of concepts Example Active Learning In Machine Learning it is common for the user to provide the examples (supervised learning) In many domains, it is easy to obtain examples, but difficult to classify them In active learning the system automatically selects the unlabeled examples and shows them to the user to obtain a class value An intelligent example selection produces better models Active Learning It can be used in robotics to create interesting examples and guide the learning process It requires the user intervention but the system is the one that drives the learning process Learning new relations • The definition of adequate relations and actions is not always an easy task • Inadequate abstractions can miss relevant characteristics of the environment and produce sub-optimal policies • An incomplete set of actions may prevent the agent from reaching a goal When to learn a new relation? • Given an action which works in most cases, e.g.: • Recognize: • Unexpected rewards • Inapplicability How to learn it? • Gather a set of examples • (+) ... • (-) ... • Feed them to an ILP system Refinement of the initial abstraction • Identify unexpected instances and learn a new relation (new_rel) • Add it negated (not new_rel) to the current raction as an extra condition • Create a copy of the old r-action with new_rel as an extra condition and ask the user for an adequate action • Applied successfully in the KRK endgame from an initial set of relations and r-actions rQ-learning con KRK • After some refinements a total of 26 relations and 27 r-actions were used (1,318 r-states and 2.67 r-actions per state vs. 150,000 states and 22 actions per state) • After 5,000 training games the learned policy uses 12.07 moves on average to check-mate over 100 random positions • An improvement of 2.5 moves over a manually built strategy with the same actions over the same positions KRK Challenges Improve policies once learned Real-time learning Plan useful actions to accelerate learning Decompose human demonstrations Identify what is relevant from the demonstration Select an adequate representation Drive/initiate the learning process Conclusions • There is a large number of machine learning techniques that can (should) be used for a wide variety of robot tasks • Mobile robots normally posed new challenges to ML and CV • There is an increasing interest in Robot Learning as techniques are becoming more mature and feasible for different tasks Thanks! emorales@inaoep.mx