# Markov Decision Processes A to Z

Document Sample

```					Markov Decision Processes:
A Survey
Martin L. Puterman
Outline
•   Example - Airline Meal Planning
•   MDP Overview and Applications
•   Airline Meal Planning Models and Results
•   MDP Theory and Computation
•   Bayesian MDPs and Censored Models
•   Reinforcement Learning
•   Concluding Remarks

Martin L. Puterman - June 2002   2
Airline Meal Planning
• Goal: Get the right number of meals on
each flight
• Why is this hard?
• Why is this important to an airline?
– 500 flights per day  365 days  \$5/meal = \$912,500

Martin L. Puterman - June 2002        3
60.0

45.7

31.4
Amount

17.1

2.9

-11.4

-25.7

-40.0
Diff36_6      Diff6_3      Diff3_2    Diff2_1   Diff1_0
Time Period

Martin L. Puterman - June 2002                 4
How Significant is the Problem?
Frequency

-20   -10         0            10            20   30
Provisioning Error
(Meals)

Martin L. Puterman - June 2002             5
The Meal Planning Decision Process
• At several key decision points up to 3 hours
before departure, the meal planner observes
reservations and meals allocated and adjusts
allocated meal quantity.
• Hourly in the last three hours, adjustments
significantly higher and limited by delivery

Martin L. Puterman - June 2002   6
Meal Planning Timeline

18 hours     6 hours       3 hours                 Departure
Schedule,     Order         Order
Production   Assembly    ready to go                       Observe
Passenger

Delivery of
Order
with Van

Martin L. Puterman - June 2002                      7
Airline Meal Planning
Operational goal; develop a meal planning
strategy that minimizes expected total overage,
underage and operational costs
A Meal Planning Strategy specifies
at each decision point the number of extra
meals to prepare or deliver for any
observed meal allocation and
reservation quantity.

Martin L. Puterman - June 2002   8
Why is Finding an Optimal Meal
Planning Strategy Challenging?
• 6 decision points
• 108 passengers
• 108 possible actions
• One strategy requires 1081086 = 69984 order
quantities.
• There are 7,558,272 strategies to consider.
• Demand must be forecasted.

Martin L. Puterman - June 2002   9
Airline Meal Planning Characteristics
• A similar decision is made at several time
points
• There are costs associated with each
decision
• The decision has future consequences
• The overall cost depends on several events
• There is uncertainty about the future

Martin L. Puterman - June 2002   10
What is a Markov decision process?
• A mathematical representation of a sequential
decision making problem in which:
– A system evolves through time.
– A decision maker controls it by taking actions
at pre-specified points of time.
– Actions incur immediate costs or accrue
immediate rewards and affect the subsequent
system state.

Martin L. Puterman - June 2002     11
MDP Overview

Martin L. Puterman - June 2002   12
Markov Decision Processes
are also known as:
•   MDPs
•   Dynamic Programs
•   Stochastic Dynamic Programs
•   Sequential Decision Processes
•   Stochastic Control Problems

Martin L. Puterman - June 2002   13
Early Historical Perspective
•   Massé - Reservoir Control (1940’s )
•   Wald - Sequential Analysis (1940’s )
•   Bellman - Dynamic Programing (1950’s)
•   Arrow, Dvorestsky, Wolfowitz, Kiefer, Karlin - Inventory
(1950’s)
•   Howard (1960) - Finite State and Action Models
•   Blackwell (1962) - Theoretical Foundation
•   Derman, Ross, Denardo, Veinott (1960’s) - Theory - USA
•   Dynkin, Krylov, Shirayev, Yushkevitch (1960’s) - Theory -
USSR

Martin L. Puterman - June 2002      14
Basic Model Ingredients
• Decision epochs – {0, 1, 2, …., N} or [0,N]
or {0,1,2, …} or [0,)
• State Space – S (generic state s)
• Action Sets – As (generic action a)
• Rewards – rt(s,a)
• Transition probabilities – pt(j|s,a)
A model is called stationary if rewards and
transition probabilities are independent of t

Martin L. Puterman - June 2002   15
System Evolution
at                               at+1

st                              st+1

rt(st,at)                         rt+1(st+1,at+1)
Decision Epoch                  Decision Epoch
t                             t +1

Martin L. Puterman - June 2002     16
Another Perspective
s1
s
1

a
s2       1
s2

s
s3                                             3
a2

s4

s4

Martin L. Puterman - June 2002        17
Yet Another Perspective:
An Event Timeline

May 15       June 1            June 10           June 15      ...

Place June    May             May sales           Place July   ...
order      order           data arrives;       order
arrives;         prepare July
ship           forecast
product
to DCs

Martin L. Puterman - June 2002               18
Some Variants on the Basic Model
• There may be a continuum of states and/or actions
• Decisions may be made in continuous time
• Rewards and transition rates may change over
time
• System state may be not observable
• Some model parameters may not be known

Martin L. Puterman - June 2002   19
Derived Quantities
• Decision Rules: dt(s)
• Policies, Strategies or Plans:  = ( d1, d2, …) or 
= ( d1, d2, …, dN)
• Stochastic Processes : ( Xt, Yt ), Es { }
• Value Functions: vt (s), v(s), g , …
– Value functions differ from immediate rewards, they
represent the value starting in a state of all future events

Martin L. Puterman - June 2002          20
Objective
Identify a policy that maximizes either the
– expected total reward (finite or infinite horizon)

v(s) = Es {  rt (Xt,Yt ) }
t=0
– expected discounted reward
– expected long run average reward
– expected utility
possibly subject to constraints on system
performance

Martin L. Puterman - June 2002   21
The Bellman Equation
MDP computation and theory focuses on solving
the optimality (Bellman) equation which for
infinite horizon discounted models

v(s)  max { r (s, a)    p( j | s, a)v( j )}
aAs
jS

This can also be expressed as
v = Tv or Bv = 0
v(s) is the value function of the MDP

Martin L. Puterman - June 2002   22
Some Theoretical Issues
• When does an optimal policy with nice structure
exist?
– Markov or Stationary Policy
– (s,S) or Control Limit Policy
• When do computational algorithms converge? and
how fast?
• What properties do solutions of the optimality
equation have?

Martin L. Puterman - June 2002   23
Computing Optimal Policies
• Why?
– Implementation
– Gold Standard for Heuristics
• Basic Principle - Transform multi-period problem
into a sequence of one-period problems.
• Why is computation difficult in practice?
– Curse of Dimensionality

Martin L. Puterman - June 2002   24
Computational Methods
• Finite Horizon Models
– Backward Induction (Dynamic Programming)
• Infinite Horizon Models
–   Value Iteration
–   Policy Iteration
–   Modified Policy Iteration
–   Linear Programming
–   Neuro-Dynamic Programming/Reinforcement learning

Martin L. Puterman - June 2002   25
Infinite Horizon Computation
• Iterative algorithms work as follows:
~
– Approximate the value function by v(s)
– Select a new decision rule by
d (s)  arg max { r (s, a)  
aAs
             ~
p( j | s, a)v ( j )}
jS
– Re-approximate the value function
• Approximation methods
– exact - policy iteration
– iterative - value iteration and modified policy iteration
– simulation based - reinforcement learning

Martin L. Puterman - June 2002                 26
Applications (A to N)
• Airline Meal Planning         • Highway Pavement
• Behaviourial Ecology            Repair
• Capacity Expansion            • Inventory Control
• Decision Analysis             • Job Seeking Strategies
• Equipment                     • Knapsack Problems
Replacement                   • Learning
• Fisheries Management          • Medical Treatment
• Gambling Systems              • Network Control

Martin L. Puterman - June 2002       27
Applications (O to Z)
• Option Pricing                •    User Modeling
• Project Selection             •    Vision (Computer)
• Queueing System               •    Water Resources
Control                       •    X-Ray Dosage
• Robotic Motion                •    Yield Management
• Scheduling                    •    Zebra Hunting
• Tetris

Martin L. Puterman - June 2002         28
“Coffee, Tea or …? A Markov Decision
Process Model for Airline Meal Provisioning”
J. Goto, M.E. Lewis and MLP

• Decision Epochs: T= {1, …,5}
–   0 - Departure time
–   1-3: 1,2 and 3 Hours Pre-Departure
–   4: 6 Hours Pre - Departure
–   5: 36 Hours Pre-Departure
• States: {(l,q): 0  l  Booking Limit, 0  q  Capacity}
• Actions: (Meal quantity after delivery)
– At,(l,q)= { 0, 1, …, Plane Capacity}          t = 3,4,5
– At,(l,q) = {q-van capacity, …, q + van capacity)          t= 1,2

Martin L. Puterman - June 2002            29
Markov Decision Process Formulation

Costs: (depending on t)
rt((l,q),a) = Meal Cost + Return penalty + late delivery charge +
shortage cost

Transition Probabilities:
pt(q’|q)                  a=q’
pt((l’,q’)|(l,q),a) = 
0                         aq’

Martin L. Puterman - June 2002         30
An Optimal Decision Rule
Meal Quantity
Decision
Epoch 1

18 hours                      6 hours         3 hours              Adjust Departure
Schedule,                      Order           Order                with
Production                    Assembly      ready to go             Van

Martin L. Puterman - June 2002                      31
Empirical Performance
Actual
Frequency                                   Mean = 9.81
Standard Deviation = 8.46

Model (out of sample)
Frequency

Mean = 7.99
Standard Deviation = 6.96

-20      0            20            40
Provisioning Error

Martin L. Puterman - June 2002                    32
Overage versus Shortage
• Evaluate the model over a range of terminal
costs
• Observe the relationship of average overage
and proportion of flights short-catered
• 55 flight number / aircraft capacity
combinations (evaluated separately)

Martin L. Puterman - June 2002   33
Overage versus Shortage
Performance of optimal policies
Optimal          Optimal              Actual
outperforms        close to          outperforms
FLIGHT GROUP       actual           actual             optimal     Total
INTERNATIONAL         6               1                   0         7
TRANSBORDER          13               3                   0         16
TRANSCON              9               0                   2         11
WEST                 11               3                   7         21
39               7                   9         55

Martin L. Puterman - June 2002                         34
Information Acquisition

Martin L. Puterman - June 2002   35
Information Acquisition and Optimization
• Objective: Investigate the tradeoff between acquiring
information and optimal policy choice
• Examples:
– Harpaz, Lee and Winkler (1982) study output decisions of a
competitive firm in a market with random demand in which the
demand distribution is unknown.
– Braden and Oren (1994) study dynamic pricing decisions of a
firm in a market with unknown consumer demand curves.
– Lariviere and Porteus (1999) and Ding, Puterman and Bisi
(2002) study order decisions of a censored newsvendor with
unobservable lost sales and unknown demand distributions.
• Key result - it is optimal to “experiment”

Martin L. Puterman - June 2002         36
Bayesian Newsvendor Model
• Newsvendor cost structure (c - cost, h - salvage value,
p - penalty cost)
• Demand assumptions
– positive continuous
– i.i.d. sample from f(x|) with unknown 
– prior on  is 1()
• Assume first that demand is unobservable
Demand = Sales + “observed” lost sales

Martin L. Puterman - June 2002   37
Time Line of Events

set    obs.                   set obs.
y1     x1                     y2  x2

 1 ( )                    2 ( )                        3 ( )

1                           2                               3
r ( x1 , y1 )                   r ( x2 , y2 )

Martin L. Puterman - June 2002       38
Demand Updating
xn
 n ( )                                    n 1 ( )

n         n+1

f ( x n |  ) n ( )
 n 1 ( )  Pr ( | x n ,  n ) 


f ( x n |  ) n ( )d

mn ( x )     

f ( x|  ) n ( )d

Martin L. Puterman - June 2002                     39
Bayesian Newsvendor Model
Bayesian MDP Formulation
At decision epoch n, (n=1,2, ..., N)
– States:
Sn  {all probability distributions on the
unknown parameter}
– Actions:          As  R+  [0, )  s
– Costs: r ( n , yn )  E [r ( X , yn )]
n

yn                          
 cy n  h  ( y n  x )mn ( x )dx  p  ( x  y n )mn ( x )dx
0                            yn

– Transition Prob: p( n1 | n , yn , xn )  mn ( xn )
Martin L. Puterman - June 2002                             40
Bayesian Newsvendor Model
The Optimality Equations
un ( n )  min{r ( n , y n )   un 1 ( n 1 )mn ( x )dx}
yn R                      R

for n=1,…,N with the boundary condition:
u N 1 ( N 1 )  r ( N 1 , y N 1 )  0
Key Observation
The transition probabilities are independent of the
actions. So the problem can be reduced to a sequence
of single-period problems.

Martin L. Puterman - June 2002             41
The Bayesian Newsvendor Policy
The BMDP reduces to a sequence of single-period,
two-step problems.
• Demand distribution parameter updating
xn-1
 n1                     n
• Cost minimization

pc
1
y   BN
 M (     ) for n  1,..., N .
n         n
ph

where Mn is the CDF of mn

Martin L. Puterman - June 2002   42
Bayesian Newsvendor with Unobservable
Lost Sales
• Model Set-up
– Same as fully observable case but unmet
demand is lost and unobservable

• Question
– Is the Bayesian Newsvendor policy optimal?

Martin L. Puterman - June 2002   43
Bayesian Newsvendor with Unobservable Lost
Sales
Demand is censored by the order quantity.
X n  xn  yn : demand exactly observed
X n  xn  yn : demand censored
Demand updating is different in this case:
               f ( x1 | )1 ( )
2 ( )                                 if x1 < y1


 f ( x1| )1 ( )d
2 ' ( )                   

  c ( )  y1
f ( x1 | )dx1 ( )
if x1 = y1
 2               


             y1 f ( x1| )1 ( )dxd

Martin L. Puterman - June 2002              44
Bayesian Newsvendor with Unobservable Lost Sales
Set yn-1                                                              f ( 0 | ) n1 ( )
 n ( ) 

f ( 0 | ) n1 ( )d

f (1 | ) n1 ( )
 n ( ) 
 n1                                                          
f (1 | ) n 1 ( )d


              f ( x | )dx n1 ( )
        ( ) 
c                 y n 1


2
f ( x | ) n1 ( )dxd
 y n 1

Martin L. Puterman - June 2002                                             45
Bayesian Newsvendor with Unobservable Lost Sales
Model Formulation
– States, Actions, Costs: As above
– Transition Probabilities:

 mn ()        whe  n   n
n `
p( n 1 |  'n , yn )  
1 - M c ( yn ) when  n ' =  n
n
c

The Optimality Equations
u N ( N ' )  min {r ( N ' , y N )}
y N R
y1
u n ( n )  min {r ( n , yn )   u n 1 ( n 1 )mn ( x)dx
y n R                        0

 u n 1 ( n 1 )[1  M n 1 ( yn )]}
c            c

Martin L. Puterman - June 2002       46
Bayesian Newsvendor with Unobservable Lost Sales

The Key Result
pc
yn  yn  M n1 (
*    BN
) for n  1,...,N  1;
ph
   pc
y*  y N
BN
 M N 1(     )
ph
N

if f(x| ) is likelihood order increasing in .
• In this model, decisions in separate periods are
interrelated through the optimality equation.
• This means that it is optimal to tradeoff learning for short
term optimality.
• Question: What is an upper bound on yn*?

Martin L. Puterman - June 2002          47
Bayesian Newsvendor with Unobservable Lost Sales

Solving the optimality equations gives
For N:
(c  h) M N ( y N )  ( p  c)[1  M N ( y N )]

For n=1,..., N-1,
dunc1 ( yn )                                 m (y )
(c  h) M n ( yn )  [ p                 (unc1 ( yn )  un1 ( yn )) n n  c][1  M n ( yn )]
dyn                                      1  M n ( yn )
(c  h) M n ( yn )  ( p'( yn )  c)[1  M n ( yn )]

where p’(yn) is a “policy dependent penalty cost .
Proof of key result is based on showing p’(yn) > p for n < N.

Martin L. Puterman - June 2002                                48
Bayesian Newsvendor with Unobservable Lost Sales

– The extra penalty can be interpreted as the
marginal expected value of information at
decision epoch n.
– Numerical results show small improvements
when using the optimal policy as opposed to the
Bayesian Newsvendor policy.
– We have extended this to a two level supply
chain

Martin L. Puterman - June 2002   49
Partially Observed MDPs
• In POMDPs, system state is not observable.
– Decision maker receives a signal y which is related to
the system state by q(y|s,a).
• Analysis is based on using Bayes Theorem to estimate
distribution of the system state given the signal
– Similar to Bayesian MDPs described above
• the posterior state distribution is a sufficient statistic for
decision making
– State space is a continuum
– Early work by Smallwood and Sondik (1972)
• Applications
– Medical diagnosis and treatment
– Equipment repair

Martin L. Puterman - June 2002                 50
Reinforcement Learning and
Neuro-Dynamic Programming

Martin L. Puterman - June 2002   51
Neuro-Dynamic Programming or
Reinforcement Learning
• A different way to think about dynamic
programming
– Basis in artificial intelligence research
– Mimics learning behavior of animals
• Developed to:
– Solve problems with high dimensional state spaces
and/or
– Solve problems in which the system is described by a
simulator (as opposed to a mathematical model)
• NDL/Rl refers to a collection of methods combining
Monte Carlo methods with MDP concepts

Martin L. Puterman - June 2002      52
Reinforcement Learning
• Mimics learning by experimenting in real life
– learn by interacting with the environment
– goal is long term
– uncertainty may be present (task must be repeated
many time to obtain its value)
• Trades off between exploration and exploitation
• Key focus is estimating value function ( v(s) or
Q(s,a) )
– Carry out task and observe immediate outcome (reward
and transition)
– Update value function

Martin L. Puterman - June 2002     53
Reinforcement Learning - Example
• Playing Tic-Tac-Toe (Sutton and Barto, 2000)
– You know the rules of the game but not opponents
strategy (assumed fixed over time but with random
component)
– Approach
• List possible system states
• Start with initial guess of probability of winning in each state
• Observe current state (s) and choose action that will move you
to state with highest probability of winning
• Observe state (s’) after opponent plays
• Revise value in state s based on value in s’.
– Player might not always choose “best action” but try
other actions to learn about different states.

Martin L. Puterman - June 2002                54
Reinforcement Learning - Example
– Dimension of state space is 39
– Writing down a mathematical model for the game is
challenging, simulating it is easy.
– Goal is to maximize the probability of winning, there is
no immediate reward
– Possible updating mechanism using observations
vnew(s) = vold(s) +  ( vold(s’) - vold(s) )
•  is a step-size parameter
• vold(s’) - vold(s) is a temporal-difference
– The subsequent state s’ depends on players action.

Martin L. Puterman - June 2002    55
Reinforcement Learning
Problems can be classified in two ways
Model                             No Model

Terminating   Estimate v(s) by simulating         Estimate Q(s,a) by
the process many times to           simulating the process
termination. Use estimate           many times to termination.
with model to choose                Use estimate of Q(s,a) to
policy or improvement               choose policy or
improvement
Non-          Estimate v(s) by using              Estimate Q(s,a) by using
terminating   temporal difference                 temporal difference
updating, Use estimate              updating, Use estimate of
with model to choose                Q(s,a) to choose policy or
policy or improvement               improvement

Martin L. Puterman - June 2002                  56
Temporal Difference Updating
• No model example, discounted case - based on Q(s,a)

Q ( s, a )  Es ,a { r ( X t , Yt )}
t 0

• Algorithm (policy specified)
– Start system in s, choose action a and observe s’ and a’
– Update Q(s,a)  Q(s,a) +  ( r + Q(s’,a’) - Q(s,a))
– Repeat replacing (s,a) by (s’,a’)
• Algorithm (no-policy specified) (Q-Learning)
– Start system in s, choose action a and observe s’ and a’
– Update Q(s,a)  Q(s,a) +  ( r +  max a’ A Q(s’,a’) - Q(s,a))
– Repeat replacing s by s’
• Issues include choosing  and stopping criteria

Martin L. Puterman - June 2002               57
RL Function Approximation

– replacing v(s) or Q(s,a) by representation
k
~
Q ( s, a )   wii ( s, a )
i 1
and then applying Q-learning algorithm updating
weights wi at each iteration, or
– approximating v(s) or Q(s,a) by a neural network
• Issue: choose “basis functions” i(s,a) to reflect problem
structure

Martin L. Puterman - June 2002         58
RL Applications
•   Backgammon
•   Checker Player
•   Robot Control
•   Elevator Dispatching
•   Dynamic Telecommunications Channel Allocation
•   Job Shop Scheduling
•   Supply Chain Management

Martin L. Puterman - June 2002   59
Neuro-Dynamic Programming
Reinforcement Learning

“It is unclear which algorithms and parameter settings
will work on a particular problem, and when a method
does work, it is still unclear which ingredients are actually
necessary for success. As a result, applications often
require trial and error in a long process of a parameter
tweaking and experimentation.”
van Roy - 2002

Martin L. Puterman - June 2002           60
Conclusions

Martin L. Puterman - June 2002   61
• MDPs provide an elegant formal framework for
sequential decision making
• They are widely applicable
– They can be used to compute optimal policies
– They can be used as a baseline to evaluate heuristics
– They can be used to determine structural results about
optimal policies
• Recent research is addressing “The Curse of
Dimensionality”

Martin L. Puterman - June 2002          62
Some References
Bertsekas, D.P. and Tsitsiklis, J.N., Neuro-Dynamic
Programming, Athena, 1996.
Feinberg, E.A. and Shwartz, A. Handbook of Markov
Decision Processes: Methods and Applications, Kluwer
2002.
Puterman, M.L. Markov Decision Processes, Wiley, 1994.
Sutton, R.S. and Barto, A.G. Reinforcement Learning, MIT,
2000.

Martin L. Puterman - June 2002        63

```
DOCUMENT INFO
Shared By:
Categories:
Stats:
 views: 12 posted: 9/6/2011 language: English pages: 63
How are you planning on using Docstoc?