Dynamic Programming
Zijing Hu
November 11, 2022
Contents
1 Finite Horizon Dynamic Programming 1
2 Infinite Horizon Dynamic Programming 3
2.1 Dynamic Programming with Discounted Cost . . . . . . . . . . . . . . . . . . . . . . . 3
2.2 Dynamic Programming with Average Cost . . . . . . . . . . . . . . . . . . . . . . . . . 6
*This note is based on ECEN 755: Stochastic Systems by Dr. P. R. Kumar, TAMU.
1 Finite Horizon Dynamic Programming
Dynamic Programming
It solves all initial states (computational complex)
It uses backward recursion
It uses the principle of optimality: segments of optimal path are still optimal
The optimal solution is determined in feedback form/closed loop form that maps states to actions
Notations
x(t) X : state at time t, where X is state space
u(t) U: action taken at time t, , where U is action/control set
T : finite time horizon
c(x, u, t): lost of taking action u when in state x at time t
d(x): cost of begin in state x (terminal cost)
f(x, u, t): state you will be in at time t + 1 if taking action u in state x at time t
Discrete Time Dynamic System. The target problem is
min
u
"
d (x(T )) +
T 1
X
t=0
c (x(t), u(t), t)
#
.
Define V (x, T ) as the minimum cost from state x at time t till the end. Then, we have
V (x, T ) = d(x(T ))
V (x, t 1) = min
u
c(x, u, t 1) + V (f (x, u, t 1), t)
Continuous Time Dynamic System. Given the constraint ˙x(t) = f(x(t), u(t), t) and the target problem
becomes
1
min
u
"
d (x(T )) +
Z
T 1
t=0
c (x(t), u(t), t) t
#
.
Define V (x, T ) similarly as above. Then, we have
V (x, t) = min
u
c(x, u, t 1)h + V (x + f (x, u, t 1)h, t + h)
V (x, t) = min
u
c(x, u, t 1)h + V (x, t) +
V (x, t)
t
f(x, u, t 1)h +
V (x, t)
t
h
V (x, t)
t
= min
u
c(x, u, t 1) +
V (x, t)
t
f(x, u, t 1)
Stochastic System. In this scenario V (x, t) is the expected cost-to-go. We have a controlled Markov
chain in which the transition probability is given by p
ij
(u), u U, which means one can alter the
transition probability. Then, we have
V (x, t) = min
u
c(x, u, t) +
X
j∈X
p
xj
(u)V (j, t + 1)
The expected cost is given by
E
"
T 1
X
t=0
c(x(t), u(t), t) + d(x(T )) | x(0) = x
0
#
But how do you choose your action (u(0), u(1), . . . , u(t 1))? We need policy.
Policy
History-dependent policy: choose the action based on the past H
t
= (x(0), u(0), . . . , x(t
1), u(t 1)) and u(t) = g
t
(H
t
) where g
t
= (g
0
, . . . , g
t1
).
State-dependent/Markov chain policy: u(t) = g
t
(x(t)).
Randomized policy: choose probability distribution of actions instead of actions.
Theorem 1.1. Define γ
= (γ
1
, . . . , γ
t1
) where γ
t
: X U as a state-dependent policy such that u
minimizes
h
c(x, u, t) +
P
j∈X
p
xj
(u)V (j, t + 1)
i
for each (x, t). Then, we have the expected cost
V (x, t) = E
γ
"
T 1
X
n=t
c(x(n), u(n), n) + d(x(T )) | x(t) = x
t
#
is optimal in the class of all history-dependent policy γ for all x
t
.
Proof. Let γ = (γ
1
, . . . , γ
t1
) denote a history-dependent policy. Then the expected cost is
V
γ
t
= E
γ
"
T 1
X
n=t
c(x(n), u(n), n) + d(x(T )) | H
t
#
2
It is easy to show that V
γ
T
V (x, T ) is true for any T . Assume that V
γ
s
V (x, s) for s = t, t+1, . . . , T
and consider
V
γ
t1
= E
γ
"
T 1
X
n=t1
c(x(n), u(n), n) + d(x(T )) | H
t1
#
= c(x(t 1), u(t 1), t 1) + E
γ
"
E
γ
"
T 1
X
n=t
c(x(n), u(n), n) + d(x(T )) | H
t
#
| H
t1
#
= c(x(t 1), u(t 1), t 1) + E
γ
[V
γ
t
| H
t1
]
Given the assumption, we have
V
γ
t1
c(x(t 1), u(t 1), t 1) + E
γ
[V (x, t) | H
t1
]
= c(x(t 1), u(t 1), t 1) +
X
j
p
x(t1),j
(u(t 1))V (j, t)
min
u
c(x(t 1), u, t 1) +
X
j
p
x(t1),j
(u)V (j, t)
= V (x, t 1)
The induction is complected. This shows that (1) γ
is optimal, (2) dynamic programming gives
optimal cost-to-go, and (3) Markov chain policy is optimal.
2 Infinite Horizon Dynamic Programming
Cost Criteria. The total cost criteria is
E
"
+
X
t=0
c(x(t), u(t)) | x(0) = x
0
#
might not exist (e.g., c = 1) or not even be defined (e.g., c = (1)
t
). Therefore, we turn to discounted
cost criteria, in which 0 < β < 1 and
E
"
+
X
t=0
β
t
c(x(t), u(t)) | x
0
#
.
Smaller β leads to myopic. One can alternatively use average cost criterion that cares only about
the asymptotic limit of the cost
lim
T +
1
T
E
"
T 1
X
t=0
c(x(t), u(t)) | x(0) = x
0
#
.
2.1 Dynamic Programming with Discounted Cost
Let
V
N
(x) = min
u
E
g
"
N1
X
t=0
β
t
c(x(t), u(t)) | x(0) = x
0
#
where u = g(x). Then we write the dynamic programming equation with the boundary condition
V
0
(x) = 0
V
N
(x) = min
u
c(x, u) + β
X
j∈X
p
xj
(u)V
N1
(j)
3
If N goes to infinite, for all x we have
V (x) = min
u
c(x, u) + β
X
j∈X
p
xj
(u)V (j)
For each x, let the minimizing u be g
(x) in which g
: X U. There are several questions need
to be answered: does the solution of the dynamic programming equation exist? How many solutions
exist? Is the solution optimal? How to compute it?
Definition 2.1. Let F be a closed set and suppose that there is a norm ∥·∥ on F. A contraction
mapping T : F F satisfies T (x) T (y) βx y x, y, where 0 < β < 1. A point w such that
T (w) = w is called fixed point of T .
Theorem 2.2. Contraction Mapping Principle. Let F be a complete normed vector space (Ba-
nach space). Let T : F F be a contraction mapping. Then
1. There exists a unique fixed point w such that T (w) = w
2. Take any z F, we have lim
n+
T
(n)
(z) w
Proof. Take any z F. Then, for any integer m and n in which 0 < m < n we have
T
m
(z) T
n
(z)
n1
X
i=m
T
i
(z) T
i+1
(z)
β
n
1 β
T
1
(z) z
In the dynamic programming equation, the LHS is treated as point z and the RHS is treated as T (z).
Here we need to prove that T is a contraction mapping. Suppose there are two point w and z. Then,
we have
T (V
w
(x)) = min
u
c(x, u) + β
X
j∈X
p
xj
(u)V
w
(j)
T (V
z
(x)) = min
u
c(x, u) + β
X
j∈X
p
xj
(u)V
z
(j)
Suppose that ¯u minimize the second equation. Then we have
T (V
w
(x)) = min
u
c(x, u) + β
X
j∈X
p
xj
(u)V
w
(j)
c(x, ¯u) + β
X
j∈X
p
xj
(¯u)V
w
(j)
T (V
w
(x)) T (V
z
(x)) β
X
j∈X
p
xj
(¯u)(V
w
(j) V
z
(j))
Choose infinite norm, we have
T (V
w
) T (V
z
)
βV
w
V
z
How many stationary policies are there? |U|
|X |
. How to search for the optimal policy?
Value Iteration. Using the contraction mapping principle, we can compute the value function for each
policy and choose the optimal one.
4
Algorithm 1 Value Iteration
Require:
State set X , action set U, cost function c : X × U R, and β (0, 1)
procedure ValueIteration(X , U, c, β)
Initialize value function V arbitrarily
while V is not converged do
V
V
for x X do
V (x) min
u
{c(x, u) + β
P
j∈X
p
xj
(u)V
(j)}
return V
Policy Iteration. Given with any stationary policy π : X U, one can easily solve the value function
of π, V
π
= [V
π
(1), . . . , V
π
(n)], in which n = |X |, through a linear equation.
V
π
= (I βP )
1
C
Now choose new policy (using greedy policy)
π
(x) = argmin
u∈U
c(x, u) + β
X
j∈X
p
xj
(u)V
π
(j)
Hence, we have
min
u∈U
c(x, u) + β
X
j∈X
p
xj
(u)V
π
(j)
c(x, π(x)) + β
X
j∈X
p
xj
(u)V
π
(j) = V
π
(x)
and
V
π
(x
t
) = E
π
[c(x
t
, u
t
) + βV
π
(x
t+1
) | x
t
]
E [c(x
t
, π
(x
t
)) + βV
π
(x
t+1
) | x
t
]
= E
π
[c(x
t
, u
t
) + βV
π
(x
t+1
) | x
t
]
E
π
[c(x
t
, u
t
) + βE [c(x
t+1
, π
(x
t+1
)) + βV
π
(x
t+2
) | x
t+1
] | x
t
]
= E
π
h
c(x
t
, u
t
) + βE
π
[c(x
t+1
, u
t+1
) + βV
π
(x
t+2
) | x
t+1
] | x
t
i
= E
π
c(x
t
, u
t
) + βc(x
t+1
, u
t+1
) + β
2
V
π
(x
t+2
) | x
t
. . .
E
π
c(x
t
, u
t
) + βc(x
t+1
, u
t+1
) + β
2
c(x
t+2
, u
t+2
) + · · · | x
t
= V
π
(x
t
)
So in the below algorithm we use the monontone convergence theorem to search the optimal policy
5
Algorithm 2 Policy Iteration
Require:
State set X , action set U, cost function c : X × U R, and β (0, 1)
procedure PolicyIteration(X , U, c, β)
Initialize policy π arbitrarily
while π is not converged do
π
π
Compute the transition probability matrix P
V
π
(I βP )
1
C
for x X do
π(x) argmin
u
{c(x, u) + β
P
j∈X
p
xj
(u)V
π
(j)}
return π
2.2 Dynamic Programming with Average Cost
Consider the following value function
V
N
(x) = min
u
E
"
N1
X
t=0
c(x(t), u(t)) | x(0) = x
0
#
We can derive the dynamic programming equation
V
N
(x) = min
u
c(x, u) +
X
j∈X
p
xj
(u)V
N1
(j)
Then, we have
V
N
(x)
N
= min
u
c(x, u) +
X
j∈X
p
xj
(u)
V
N1
(j) V
N
(x) +
V
N
(x)
N
Assume that
lim
N+
V
N
(x) N J
w(x), x,
where J
is the long-term average cost.
J
+ w(x) = min
u
c(x, u) +
X
j∈X
p
xj
(u)w(j)
There are |X | equations and |X | + 1 unknowns. So we need to fixed one unknown. For example, we
let w(1) = 0.
Theorem 2.3. Suppose {J
, w(1), . . . , w(n)} solves the dynamic programming for the long-term av-
erage cost problem. Denote the corresponding policy as π
, which solves the ”min” formula for each
x. Then it is the optimal policy that is stationary.
Proof. Let π be any history dependent policy. We have
c(x(t), π(x(t))) J
+ w(x(t))
X
j∈X
p
xj
(π
(x(t)))w(j)
1
N
N1
X
t=0
c(x(t), u(t))
1
N
N1
X
t=0
J
+
1
N
N1
X
t=0
[w(x(t)) E[w(x(t + 1))]]
6
1
N
V
π
N
(x) J
+
w(x(0)) w(x(t + 1))
N
When N goes to infinity, we have
1
N
V
π
N
(x) J
.
Policy Iteration Algorithm. Let π be any stationary policy. We solve
J
π
+ w
π
(x) =
c(x, π(x)) +
X
j∈X
p
xj
(π(x))w
π
(j)
to get {J
π
, w
π
(1), . . . , w
π
(n)}. Let u = π
(x) solves
min
u
c(x, u) +
X
j∈X
p
xj
(u)w
π
(j)
Let σ(x) be the invariant probability distribution of π
X
x∈X
σ(x)J
π
+
X
x∈X
σ(x)w
π
(x)
X
x∈X
σ(x)c(x, π
(x)) +
X
x∈X
σ(x)
X
j∈X
p
xj
(π
(x))w
π
(j)
J
π
X
x∈X
σ(x)c(x, π
(x)) = J
π
7