Solving Factored MDPs with Continuous and Discrete Variables
Carlos Guestrin Milos Hauskrecht Branislav Kveton
Intel Research, Berkeley Department of Computer Science Intelligent Systems Program
Introduction Approximate LP for HMDPs Factored -HALP Algorithm Experimental Results
Hybrid Markov Decision Processes Linear Value Function Factored -HALP Formulation Irrigation Network Example
Many real-world stochastic planning problems have continuous Value function represented as a linear combination of k basis HALP formulation contains infinite number of constraints, one Irrigation network is a network of irrigation channels connected
and discrete variables, naturally formulated as hybrid MDPs functions: k for each state x and action a by regulation devices
(HMDPs) V( x ) w i fi x Discretization of continuous state and action variables to (1 /
There are few methods for solving Hybrid MDPs i1
2 + 1) equally spaced values Large irrigation network n-ring-of-rings topology
Inflow
Basis functions fi(x) depend on continuous and discrete regulation
device
variables. Optimization is performed over weights w
Hybrid MDPs are Complex to Solve Total number points per factor exponential only in the Outflow
dimension of factor regulation
Traditional solution techniques are affected by the curse of
HALP Formulation Number of constraints is finite, although exponential in the
device
n-ring topology
dimensionality number of variables
Discrete-state MDPs Hybrid approximate LP (HALP) formulation:
State and action spaces grow exponentially with the
minimize w w i i Efficient Solution for Factored -HALP
number of variables i
Continuous-state MDPs
State and action spaces are infinitely large
subject to : w F x , a R x , a 0,
i
i i x X , a A
1. Discretize continuous state and action variables Regulation device represented Irrigation channel represented
Often, no closed-form representation for the value function where 2. Identify subsets of variables Xi and Ai (Xj and Aj) that the by a discrete action node by a continuous variable
exists i is state relevance weight functions Fi(x, a) (Rj(x, a)) depend on
Naïve discretization often leads to exponential complexity Fi(x, a) is a difference between basis function fi(x) and its 3. Compute Fi(xi, ai) and (Rj(xj, aj)) for all possible Transition functions represent water flows between channels
discounted backprojection configurations of Xi and Ai (Xj and Aj) given actions at regulation devices
4. Calculate state relevance weights i Objective is the operation of valves to maintain optimal water
Factored Hybrid MDPs i x f x dx i C 5. Use ALP algorithm for factored discrete-valued variables to levels
xD x C
find the vector of optimal weights w (Guestrin et al. 2001) Reward function characterizes preferred water levels
Fi x , a f x px x , af xdx
i i C
Multiagent factored hybrid A1 x x
Experimental Results
D C
MDP (HMDP) is a 4-tuple
(X, A, P, R): Near Feasibility Implies Near Optimality
X is a vector of state
X1 X’1
Quality of HALP Approximation
R1
Continuous formulation of the irrigation network problem
variables (discrete or Solution of -HALP likely violates constraints in the HALP cannot be solved exactly by any MDP solver
continuous) Proposition 1 Let w be an optimal solution of the HALP. Proposition 2 Let w be an optimal solution of the HALP and
A2 Evaluation of solution quality (mean and standard deviation)
A is a vector of action Then, for any Lyapunov function L(x): w be an optimal solution of the -HALP, such that solution w is
ˆ ˆ
w and running time (in seconds):
variables (discrete or d-infeasible. Then:
continuous) X2 X’2 2 TL V Hw
ˆ -HALP Alternative solutions
The quality of the -HALP
Continuous variables V Hw min V Hw w F x, a R x, a d
ˆ
1,
1
Mean
42.8
Std
3.0
Time
2
Method
Random
Mean
35.9
Std
2.7
solution beats alternative
R2
1 , 1 w ,1 L
i
i i
d 1/2 60.3 3.0 21 Local 55.4 2.5
approximate optimization
are restricted to [0,1] x X , a A
V Hw
2 1/4 61.9 2.9 184 Global 1 60.4 3.0
techniques on the large
irrigation network
P is a transition model
1, 1 1/8 72.2 3.5 1068 Global 4 66.0 3.6 example
A3 Analogous to de Farias and Van Roy 2001 result for 1 / 16 73.8 3.0 13219 Global 16 68.2 3.2
represented by DBN approximate LP for discrete MDPs n-ring
R is a reward function
Quality of -HALP Approximation
n=6 n=9 n = 12 n = 15 n = 18
is sum of local rewards X3 X’3 Mean Time Mean Time Mean Time Mean Time Mean Time
1 28.4 1 37.5 1 46.9 1 55.6 2 64.5 3
Representational and Computational Challenges 1/2
1/4
33.5
35.1
3
11
43.0
45.2
5
21
52.6
54.2
9
43
62.9
64.2
17
63
72.1
74.5
28
85
ˆ
Theorem 1 Let w be an optimal solution of the -HALP 1/8 40.1 46 51.4 85 62.2 118 73.2 168 84.9 193
Representation of Conditional Probabilities Constraints require representation of backprojections, functions
satisfying the d-infeasibility condition. Then, for any Lyapunov 1 / 16 40.4 331 51.8 519 63.7 709
n-ring-of-rings
75.5 963 86.8 1285
function L(x): n=6 n=9 n = 12 n = 15 n = 18
of continuous and discrete variables Mean Time Mean Time Mean Time Mean Time Mean Time
Parametric representation of transition model HALP requires solution of (linear) convex problem with infinite d 2 T L 1 14.8 1 16.2 2 17.5 4 18.5 5 19.7 6
ˆ
V Hw 2 min V Hw 1/2 38.6 12 50.5 25 44.0 103 75.8 69 87.6 107
Discrete child with discrete parents: number of constraints 1 , 1 1 w ,1 L 1/4 40.1 82 53.6 184 66.7 345 79.0 590 93.1 861
1/8 48.0 581 62.4 1250 76.1 2367 90.5 3977 104.5 6377
Tabular, decision trees, noisy-or, etc. 1 / 16 47.1 4736 62.3 11369 77.6 22699 92.4 35281 107.8 53600
Discrete child with continuous and discrete parents:
Choice of Representation Achieving d-Infeasibility Solution quality
improves with higher
Time complexity grows
polynomially with
Time complexity grows
polynomially with network
d j (Par(X i ' )) Discriminant function grid resolution higher grid resolution topology size n
P(X i '| Par(X i ' )) 1/
d (Par(X ' ))
u
u i Normalizing factor
Continuous basis functions defined as polynomials Appropriate choice of -grid to achieve d-infeasibility
Continuous child with continuous and discrete parents: fi x i xj
m j ,i
wiFi x, a R x, a wiFi x G , aG Rx G , aG d
ˆ ˆ Conclusions
x j x i i i
P(Xi '| Par(Xi ' )) Beta X h ParX, h ParX
1 2
i i ii i
Basis function decomposition along continuous and (xG, aG) is the closest -grid point to the state-action pair (x, a) HALP provides effective formulation for solving hybrid MDPs
Mixture of beta distributions Moment > 0 Moment > 0 discrete factors Including bounds on the quality of the solution
fi x i fiD x iD fiC x iC
Factored hybrid MDPs allow for closed-form representation
Optimal Policy and Value Function Lipschitz modulus of the discretized functions of HALP constraints
d Worst-case Lipschitz Number of constraints remains infinite
constant over functions
Value function of an optimal policy satisfies the Bellman- Closed-form representation of the objective function Number of factors MK max wiFi(x, a) and Rj(x, a)
Mixture of betas transition model for continuous factors Exploit factorization for efficient discretization, -HALP
Hamilton-Jacobi fixed point equation: Provide bounds on the effect of discretization
Decomposition of the constraints along continuous and discrete
V x sup R x, a px x , aV x dx
C
functions and closed-form representation Summary of Factored -HALP Algorithm Lipschitz constant grows linearly in the number of variables
px x , af xdx
a
x x
D Using factored LP decomposition to solve -HALP
C
i C
Value function V(x) difficult to compute and represent
x x
D C Discretize continuous variables using a regular e-spaced grid For fixed tree-width, running time is polynomial in the
Approximate Formulate a linear program with constraints restricted only to
number of variables and discretization level 1/
Closed-form solution of the value function may not exist
solutions p x x , a fi x p x x , a fi x dx
due to the recursive integral definition grid points
x iD iD iC iC iC
iD xiC Solve the LP using an ALP algorithm for factored discrete MDPs