Robot Learning by ahuactzin

VIEWS: 313 PAGES: 137

Robot Learning is a currently very active research area as it promises to create more flexible, independent and autonomous robots. It also opens the possibility, for non-experts, to instruct a robot how to perform different tasks without the need of programming. In this short tutorial we will review some of the current main techniques used in robot learning and show some of their applications. In particular, we will focused this tutorial in reinforcement learning and programming by demonstration, but we will also review some techniques used in visual concept learning and other machine learning techniques that can be used in robotics.

More Info
									Robot Learning

Eduardo Morales INAOE – Puebla emorales@inaoep.mx

Contents
• Motivation • Most common tasks and ML techniques • Conclusions and current challenges

Motivation
• There is an increasing use of service robots for different tasks (tendency towards humanoids)

What to expect?








Service robot will become at least as important as industrial robots The service robot industry will become as important as the car industry They will become as common as laptops are today Hugh investment in counties like Japan, Korea, European Union and USA

Expected growth

Why Robot Learning?

Why Robot Learning?
• In order to become widely accepted, robots require to be flexible, autonomous, and capable of adaptating to their circunstances and Machine Learning may be the only way to do it • Machine Learning can also ease the burden of programming and allows the instruction of robots by non-experts

Challenges






It is necesary an adequate use/interpretation of visual, tactile, range, …, stimulae and an adequate selection of actions We need to estimate task parameters, adequately use primitives, recognize goals, … We also want learning to be fast (preferably online), smooth, to recover from errors, able to deal with few samples and/or large amount of data, not too sensitive to errors from demonstrations, …

Robot Learning


Robot learning can occur at different levels (direct mapping between perception and action up to planning and learning of complex tasks) and with different ML strategies (supervised, semi-supervised, unsupervised, reinforcement)

Most Common ML Techniques
• Reinforcement learning with Programming by

demonstration (or behavioral cloning)

• AAAI-2010 Learning by Demonstration Challenge

• Visual concept learning • AAAI-2010 Semantic Robot Vision Challenge • Other useful techniques: • Active learning • Hierarchical learning • Different regression techniques • Clustering • …

Robot Tasks
• • • • • • Navigation Interaction Recognition Grasping and manipulation Combined skills …

Reinforcement Learning


In reinforcement learning an autonomous agent follows a trial-and-error process to learn the optimal actions to perform in each state to reach its goals

Markov Decision Process
An MDP is described by:  A finite set of states, S  A finite set of actions per state, A  A transition model, P (s’ | s, a)  A reward function for each state, r(s) or state-action pair, r (s, a)

Example

Initial State

Transition Model






There is uncertainty in the outcome of an action The transition model gives us the probability of reaching state s’ given that we perform action a in state s: P(s’|s,a) The transition probabilities depend on the actual state (Markovian) and are stationaries

Uncertatinty in actions

Uncertainty in states

POMDP

Reward

-1

Example – reward function
-1/25 -1/25 -1/25 -1/25 -1/25 -1/25 -1/25 -1/25 -1/25 +1 -1

Finite vs. Infinite Horizon


Finite Horizon: finite number of steps Infinite Horizon: undetermined number of steps In many practical problems, and in most robotic tasks, it is considered an infinite horizon In this case the accumulated reward is evaluated as:

 



Value Functions
• The value function (V or Q) is the expected value of the accumulated reward obtained when following a particular policy, under a finite horizon or an infinite horizon

Utilities
0.812 0.762 0.868 0.912 0.660 0.655 0.611 0.338

0.705

Utility


Considering a separable (additive) utility, it can be obtained from the current reward plus the utility of the next state:

Optimal Value Functions
• Bellman equations:

Optimal Policy




Given a transition model, the objective is to find an optimal policy to maximize the expected utility The policy gives the best action to apply in each state

Optimal Policy

Initial State

Solution Techniques
There are two types of basic algorithms • Dynamic programming techniques: assume a known model (transition and reward functions) that when solved obtains an optimal policy (value iteration, policy iteration, linear programming)  Monte Carlo and reinforcement learning: the model is not known and the solution is obtained exploring the environment

Exploration vs. Exploitation
•

• •

When selecting actions, without knowing the model, there are two conflicting objectives: – Obtain immediate rewards from known good states (exploit) – Learn new state/action/reward relations with new actions going to new states (explore) Balance between immediate rewards and long term rewards There is normally a gradual shift between exploration and exploitation

Seletion action stategies
• є-greedy: selects most of the time the action that produces the greatest benefit, but with probability є selects a random action • Softmax: changes the selection probabilities according to the estimate value, the most popular being the Boltzmann distribution

Temporal Difference Methods
• TD(0) is the simplest for value functions:

• SARSA for Q values:

• Q-learning for Q values:

TD(0)
Initialize V(s) arbitrarily Repeat (for each episode): Initialize s Repeat (for each step in episode): a ← action given by π for s Take action a; obtain reward, r and next state s’ V(s) ← V(s) + α [r + γV(s’) – V(s)] s ← s’ Until s is terminal

SARSA
Initialize Q(s,a) arbitrarily Repeat (for each episode): Initialize s Select as using policy derived from Q (є -greedy) Repeat (for each step in episode): Take action a; observe r and s’ Select a’s’ using policy derived from Q (є -greedy) Q(s,a) ← Q(s,a) + α [r + γQ(s’,a’) – Q(s’,a’)] s ← s’, a ← a’ Until s is terminal

Q-Learning
Initialize Q(s,a) arbitrarily Repeat (for each episode): Initialize s Repeat (for each step in episode): Select as using policy derived from Q (є -greedy) Take action a; observe r and s’ Q(s,a) ← Q(s,a) + α [r + γ maxa’Q(s’,a’) –Q(s’,a’)] s ← s’ Until s is terminal

Now suppose we want to learn how to fly ... from our interactions with the environment ...

...

• Challenges: a large number of continuous variables, an “infinite” space and large areas without a clear reward

RL Problems
• Long training times • Continuous state and action spaces • Non transferable policies

Some proposed solutions
• Update several states at a time, learn and use a model, state/action abstractions, hierarchies, reward shaping, function approximation, provide human traces, …

Eligibility Traces
• Eligibility traces are between Monte Carlo and TD as they consider several posterior rewards (or affect several previous states) • TD simple use: • We can easily consider two (or more) known rewards:

Eligibility Traces
• In practice, instead of waiting n steps to update (forward view), a list of visited states is stored to update new states (backward view) discounted by distance • For TD(λ): • For SARSA(λ):

SARSA(λ)

Methods with/without models


Model-free methods learn a policy without trying to estimate the expected rewards and the transition probabilities (Q-learning) Model-based methods try to learn an explicit model and derive a policy using this model (converge faster but require more memory)



Planning and Learning




 

Given a model, it can be use to predict the next state and its associated reward The prediction can be a set of states with their rewards and probability of occurence With the model we can plan and learn! To the learning system it doesn’t matter if the state-action pairs are real or simulated

Dyna-Q




Dyna-Q combines the experience with planning to converge faster The idea is to learn with experience but also learn and use a model to simulate and generate new experience

Dyna-Q
Initialize Q(s,a) arbitrarily s ← current state, a ← є –greedy(s,Q) Take action a; observe r and s’ Q(s,a) ← Q(s,a) + α [r + γ maxa’Q(s’,a’) – Q(s,a)] Model(s,a) ← s’,r Repeat N times: s ← random state previously visited a ← random action previously taken in s s’,r ← Model(s,a) Q(s,a) ← Q(s,a) + α [r + γ maxa’Q(s’,a’) – Q(s,a)]

Dyna-Q

Prioritized Sweeping






The planning can be more effective if it is focussed on particular state-action pairs We can work from goals backwards or from any state with important changes in their values We can prioritized the updating according to a relevance measure

Prioritized Sweeping




The idea is to keep a queue, evaluate the previous states to the first element and add them to the queue if their change in value is greater that a certain threshold Works like Dyna-Q but updates all the states which Q-values change more than a certain threshold value

Function Approximation






So far, we have assumed that states (and state-action pairs) have an explicit tabular representation This works for small/discrete domains, but is impractical for large/continuous domains One possibility is to implicitly represent the value function (V o Q)

Function Approximation


A value function can be represented with a weighted lineal function (e.g., chess) Evaluate the error





Use gradient descent

Function Approximation
• We do not know but we can estimate it • We need an estimator and one possibility is to use Monte Carlo. • Alternatively we can use TD of n-steps as an approximation of (although it does not guarantee convergence to a local optimal)

Function Approximation


With eligibility traces we can approximate the parameters as follows: Where δ is the error And et is a vector that is updated with each component of Θ





Function Approximation
 

    

We can use linear functions The attributes can be overlapping circles (coarse coding) Partitions (tile coding), not necessarily uniform Basis radial functions Prototypes (Kanerva coding) Locally weighted regressions (LWR) Gaussian processes

Regressions






In robotics most information uses continuous variables There are different techniques to learn with continuous variables Some recent commonly used techniques are:  Locally weigthed regression (LWR)  Gaussian processes

LWR






Model-based methods (e.g., NN, mixture of Gaussians), use the data to find a parameterized model Memory-based models store data and use it each time a prediction is needed Locally weighted regression (LWR) is a memory based method that uses a local weighted (by distance – Gaussian Kernel) regression around the interest point

LWR

X

LWR
 

Is a fast method with good results Requires to store the examples and fit a distance function (in case of a Gaussian kernel, with a small standard deviation we can remove some examples, otherwise, we need to keep them)

Gaussian Processes






A Gaussian process is a collection of random variables with a joint Gaussian distribution It is specified by a mean and a covariance function The random variables represent the value of the function f(x) at point x

Graphically a GP
Without information With data

Graphycally a GP

Adjustment of Parameters


Parameters of the covariance function

Relational Abstractions
To deal with a large search space, use a state abstraction through a relational representation • Easy to represent powerful abstractions • Can incorporate domain knowledge • Can re-use learned policies on other similar problems

King – Rook vs. King
> 150,000 (positions) states & 22 actions per state

Equivalent positions - actions

Application to Robots

bed

chairs

window

stage

Relational Representation for RL
• States are represented as sets of relations (r-states) kings_in_opposition(State) and rook_divides(State) and not rook_threatened(State) ... • An r-state can cover a large number of states

Relational Representation (2)
• Actions are represented in terms of those relations (r-actions) • Syntax: –pre-conditions: set of relations –(g-)action (generalized action) –(post-conditions: set of relations)

Example
r_action(State1, State2) ← rook_divs(State1) and opposition(State2) and move(rook,State1,State2) and not_threatened(rook,State2) and l_shaped_pattern(State2)

Learning of Q(S,A)
As Q-learning, but the space is characterized by: • r-states (abstract states described by a set of properties of states) • r-actions (that consider these properties) choose A from S using e.g., є-greedy • Updates Q-values over r-states and r-actions • Can produce sub-optimal policies

Some experiments

Faster convergence

5 x 5 grid

10 x10 grid

Re-use of policies

Program by demonstration
The ides is to show the robot a task and let the robot learn to do it based on the demonstration (example) Approaches:  I guide the robot (also behavioural cloning)  I tell the robot  I show the robot (imitation)


Flight simulator
• High fidelity flight model of a high performance aircraft provided by DSTO • 28 variables, most of them continuous • Turbulence can be added as a random offset to the velocity components

rQ-learning for flying
Assume the aircraft is in the air, with constant throttle, flat flaps and retracted gear • Divide the task into two RL problems: 1) Learn to move stick left-right (aileron) 2) Learn to move stick forward-backward (elevation) • Define relations for each task • Use behavioural cloning to learn r-actions for each task •

Training mission
target1

target4

Human trace

Relations and actions
• Rels. Elevation: dist_target, elevation • Rels. Aileron: dist_target, orient_target, plane_rol, plane_rol_rate • Acts. Aileron: far_left, left, nil, right, far_right • Acts. Elevation: farfrwd, frwd, nil, back, farback

R-actions for Elevation
r_action(Id,el,State,move_stick(Move)) :distance_target(State,Dist), elevation_target(State,Angle), move_stick(Move). • There are 75 possible r-actions

R-actions for Aileron
r_action(Id,al,State,move_stick(Move)) :distance_target(State,Dist), orientation_target(State,Angle1), plane_rol(State,Angle2), plane_rol_rate(State,Trend), move_stick(Move). • There are 1,125 possible r-actions

Behavioural cloning
Two stages: • Induce r-actions from traces of flights (180 r-actions for aileron and 42 for elevation in 5 traces) • Use the learned r-actions to explore and learn new r-actions until no more learning (359 r-actions for aileron and 48 elevation after 20 trials)

Exploration mode
• Randomly choose an r-action on each r-state • If unseen state ask the user and learn a new r-action • The user can also choose an alternative action at any time • In total an average of 1.6 r-actions per aileron state and 3.2 per elevation state

Experiments with RL
1. R-actions from BC + turbulence 2. 1 without turbulence 3. 1 with only r-actions from original traces 4. 1 with ALL possible r-actions 5. 4 with initial guidance (using original traces to seed initial Q values)

Learning curves for aileron 1. BC+Expl+turb. 2. BC+Expl-turb. 3. BC-Expl+turb. 4. All r-actions 5. All+guidance Learning curves for elevation

Discussion
• In all the experiments, after certain initial (and fast) improvement, there is no more improvement • Without behavioural cloning RL is unable to learn even with guidance • The exploration mode of BC also proved to be useful for completing the set of r-actions

Performance of RL with turbulence
Clon

Human

Performance on new mission
Human

Clon

Flight tests….

Discussion
• Learning with turbulence produces more robust controllers • Behavioural cloning focuses the search on potentially relevant r-actions • We can learn from different experts and let RL to choose the best actions

Robot Navigation

Use traces and transform lowlevel sensor infomation into a high-level representation

Create high-level traces

Continuous action policy
• Learn a discrete action policy and then use LWR to produce a continuous action policy

Discrete

vs.

Continuous

Discrete vs.Continuous
• Faster, smoother and closer to expert, safer

Convergence times

Differences with user’s traces

Execution times

Safer

Inverse Reinforcement Learning








To speed-up convergence times some people carefully define a reward function (reward shaping) Inverse reinforcement learning: The idea is to use human traces to derive a reward function. It is assumed that the reward function is expressed as a linear combination of state features and that the traces provided by the user represent different policies The idea is to find a reward function that produces a policy similar to the user (combination of user policies)

Some examples
Para ver esta película, debe disponer de QuickTime™ y de un descompresor .

Para ver esta película, debe disponer de QuickTime™ y de un descompresor mpeg4.

Program by demonstration
Steps:  Observation: normaly in a controlled environment  Segmentation: different movements are identified  Interpretation: the movements are intepreted  Abstraction: an abstracted sequence is generated  Transformation: this sequence is transformed into the robot’s dynamics  Simulation: it is tested on a simulated environment  Execution: it is tested on the real enirnoment

Learning from Imitation
• Practice motor skills

Learning from Imitation
• To program others

Learning from Imitation
• Learning behavior and social acceptance

Problems
• Different perspectives • Different bodies

Program by demonstration


 

The main problem is the corresponding problem (how to map user actions into robot actions) Most people use sensors in the arm/hand This simplifies the problem, as there is no need for visual interpretation, but it does not elliminate it

Program by demonstration


Examples of special devices

Examples

Examples

Examples

Visual Machine Learning
Steps:  Segment an object from images (not always)  Extract a set of attributes (color, shape, texture – SIFT, Harris, SURF, RIFT, …)  Learn a model with a classifier Use the classifier for:  People/face/object/place recogition

Process and Segmentation

Face/skin detetion and tracking
• AdaBoost on Haar attributes (Viola & Jones) • Color – Skin • Tracking using a window around the object

Come Attention Right Left Stop

Gesture Recognition
• Feature extraction • Modeling • Recognition
M

Images

FE

Features

R

Gesture

Find face and hand

Tracking and execution of commands
Hand tracking

Gesture recognition

Human Tracking

Para ver esta película, debe disponer de QuickTime™ y de un descompresor .

SIFT
Scaled-Invariant Feature Transform (SIFT) transforms an image into a large collection of feature vectors, each of which is invariant to image translation, scaling, and rotation, partially invariant to illumination changes and robust to local geometric distortion

SIFT
 

Uses differences of gaussians at different scales Stores information of these points for classification in new images

Object Recognition in Robots
• Show the robot an object, extract features and search the object in a map

Para ver esta película, debe disponer de QuickTime™ y de un descompresor .

People Recognition

Feature extraction Localization and tracking

Recognition Accumulate evidence Results

People Recognition

Example in the robot


Problems with movement, posture, occlusion, illumination conditions, …

Visual Map
• Incorporate visual features into maps based on SIFT (or other features) and used them for localization

Clustering
Segmentation of the map
Feature extraction Clustering

Cluster Centers
Identified Adjecent Regions Analisis of regions

Add Visual information on nodes

Topological Map

Segmented Map

Localization and Mapping

example

People Identification
• Silloutes-based recognition • Distance-based segmentation using a stereo camera • Identifies standing, sitting and sideways people

Semantic robot vision challenge
• Given a list of objects find them in an unknown environment • The robots acquire data from Internet about these objects and learn a classifier • Search the objects

Semantic robot vision challenge
• Requires filtering and ranking • Other sources (such as, LabelMe, WalMart catalog, …) • Concept learning • Searching strategies

Hierachical Learning


 



In many machine learning schemes, and in particular in learning tasks, it is natural to decompose the tasks in subtasks One possibility is to learn a hierarchy of tasks There is some work on RL, but also in other ML techniques In general, the user selects the order in which to learn the concepts and/or the hierarchy of concepts

Example

Active Learning
 





In Machine Learning it is common for the user to provide the examples (supervised learning) In many domains, it is easy to obtain examples, but difficult to classify them In active learning the system automatically selects the unlabeled examples and shows them to the user to obtain a class value An intelligent example selection produces better models

Active Learning




It can be used in robotics to create interesting examples and guide the learning process It requires the user intervention but the system is the one that drives the learning process

Learning new relations
• The definition of adequate relations and actions is not always an easy task • Inadequate abstractions can miss relevant characteristics of the environment and produce sub-optimal policies • An incomplete set of actions may prevent the agent from reaching a goal

When to learn a new relation?
• Given an action which works in most cases, e.g.: • Recognize:

• Unexpected rewards • Inapplicability

How to learn it?
• Gather a set of examples • (+) ...

• (-)

...

• Feed them to an ILP system

Refinement of the initial abstraction
• Identify unexpected instances and learn a new relation (new_rel) • Add it negated (not new_rel) to the current raction as an extra condition • Create a copy of the old r-action with new_rel as an extra condition and ask the user for an adequate action • Applied successfully in the KRK endgame from an initial set of relations and r-actions

rQ-learning con KRK
• After some refinements a total of 26 relations and 27 r-actions were used (1,318 r-states and 2.67 r-actions per state vs. 150,000 states and 22 actions per state) • After 5,000 training games the learned policy uses 12.07 moves on average to check-mate over 100 random positions • An improvement of 2.5 moves over a manually built strategy with the same actions over the same positions

KRK

Challenges
    

 

Improve policies once learned Real-time learning Plan useful actions to accelerate learning Decompose human demonstrations Identify what is relevant from the demonstration Select an adequate representation Drive/initiate the learning process

Conclusions
• There is a large number of machine learning techniques that can (should) be used for a wide variety of robot tasks • Mobile robots normally posed new challenges to ML and CV • There is an increasing interest in Robot Learning as techniques are becoming more mature and feasible for different tasks

Thanks!
emorales@inaoep.mx


								
To top