Dynamic Programming - Zijing "Jimmy" Hu

Dynamic Programming

Zijing Hu

November 11, 2022

Contents

1 Finite Horizon Dynamic Programming 1

2 Inﬁnite Horizon Dynamic Programming 3

2.1 Dynamic Programming with Discounted Cost . . . . . . . . . . . . . . . . . . . . . . . 3

2.2 Dynamic Programming with Average Cost . . . . . . . . . . . . . . . . . . . . . . . . . 6

*This note is based on ECEN 755: Stochastic Systems by Dr. P. R. Kumar, TAMU.

1 Finite Horizon Dynamic Programming

Dynamic Programming

• It solves all initial states (computational complex)

• It uses backward recursion

• It uses the principle of optimality: segments of optimal path are still optimal

• The optimal solution is determined in feedback form/closed loop form that maps states to actions

Notations

x(t) ∈ X : state at time t, where X is state space

u(t) ∈ U: action taken at time t, , where U is action/control set

T : ﬁnite time horizon

c(x, u, t): lost of taking action u when in state x at time t

d(x): cost of begin in state x (terminal cost)

f(x, u, t): state you will be in at time t + 1 if taking action u in state x at time t

Discrete Time Dynamic System. The target problem is

min

d (x(T )) +

T −1

t=0

c (x(t), u(t), t)

Deﬁne V (x, T ) as the minimum cost from state x at time t till the end. Then, we have

V (x, T ) = d(x(T ))

V (x, t − 1) = min



c(x, u, t − 1) + V (f (x, u, t − 1), t)



Continuous Time Dynamic System. Given the constraint ˙x(t) = f(x(t), u(t), t) and the target problem

becomes

min

d (x(T )) +

T −1

t=0

c (x(t), u(t), t) t

Deﬁne V (x, T ) similarly as above. Then, we have

V (x, t) = min



c(x, u, t − 1)h + V (x + f (x, u, t − 1)h, t + h)



V (x, t) = min



c(x, u, t − 1)h + V (x, t) +

∂V (x, t)

∂t

f(x, u, t − 1)h +

∂V (x, t)

∂t



V (x, t)

∂t

= − min



c(x, u, t − 1) +

∂V (x, t)

∂t

f(x, u, t − 1)



Stochastic System. In this scenario V (x, t) is the expected cost-to-go. We have a controlled Markov

chain in which the transition probability is given by p

(u), u ∈ U, which means one can alter the

transition probability. Then, we have

V (x, t) = min



c(x, u, t) +

j∈X

(u)V (j, t + 1)



The expected cost is given by

T −1

t=0

c(x(t), u(t), t) + d(x(T )) | x(0) = x

But how do you choose your action (u(0), u(1), . . . , u(t − 1))? We need policy.

Policy

• History-dependent policy: choose the action based on the past H

= (x(0), u(0), . . . , x(t −

1), u(t − 1)) and u(t) = g

) where g

= (g

, . . . , g

t−1

• State-dependent/Markov chain policy: u(t) = g

(x(t)).

• Randomized policy: choose probability distribution of actions instead of actions.

Theorem 1.1. Deﬁne γ

⋆

= (γ

⋆

, . . . , γ

⋆

t−1

) where γ

⋆

: X → U as a state-dependent policy such that u

minimizes

c(x, u, t) +

j∈X

(u)V (j, t + 1)

for each (x, t). Then, we have the expected cost

V (x, t) = E

⋆

T −1

n=t

c(x(n), u(n), n) + d(x(T )) | x(t) = x

is optimal in the class of all history-dependent policy γ for all x

Proof. Let γ = (γ

, . . . , γ

t−1

) denote a history-dependent policy. Then the expected cost is

= E

T −1

n=t

c(x(n), u(n), n) + d(x(T )) | H

It is easy to show that V

≥ V (x, T ) is true for any T . Assume that V

≥ V (x, s) for s = t, t+1, . . . , T

and consider

t−1

= E

T −1

n=t−1

c(x(n), u(n), n) + d(x(T )) | H

t−1

= c(x(t − 1), u(t − 1), t − 1) + E

T −1

n=t

c(x(n), u(n), n) + d(x(T )) | H

| H

t−1

= c(x(t − 1), u(t − 1), t − 1) + E

| H

t−1

]

Given the assumption, we have

t−1

≥ c(x(t − 1), u(t − 1), t − 1) + E

[V (x, t) | H

t−1

]

= c(x(t − 1), u(t − 1), t − 1) +

x(t−1),j

(u(t − 1))V (j, t)

≥ min



c(x(t − 1), u, t − 1) +

x(t−1),j

(u)V (j, t)



= V (x, t − 1)

The induction is complected. This shows that (1) γ

⋆

is optimal, (2) dynamic programming gives

optimal cost-to-go, and (3) Markov chain policy is optimal.

2 Inﬁnite Horizon Dynamic Programming

Cost Criteria. The total cost criteria is

+∞

t=0

c(x(t), u(t)) | x(0) = x

might not exist (e.g., c = 1) or not even be deﬁned (e.g., c = (−1)

). Therefore, we turn to discounted

cost criteria, in which 0 < β < 1 and

+∞

t=0

c(x(t), u(t)) | x

Smaller β leads to myopic. One can alternatively use average cost criterion that cares only about

the asymptotic limit of the cost

lim

T →+∞

T −1

t=0

c(x(t), u(t)) | x(0) = x

2.1 Dynamic Programming with Discounted Cost

Let

(x) = min

N−1

t=0

c(x(t), u(t)) | x(0) = x

where u = g(x). Then we write the dynamic programming equation with the boundary condition

(x) = 0

(x) = min







c(x, u) + β

j∈X

(u)V

N−1

(j)







If N goes to inﬁnite, for all x we have

V (x) = min







c(x, u) + β

j∈X

(u)V (j)







For each x, let the minimizing u be g

⋆

(x) in which g

⋆

: X → U. There are several questions need

to be answered: does the solution of the dynamic programming equation exist? How many solutions

exist? Is the solution optimal? How to compute it?

Definition 2.1. Let F be a closed set and suppose that there is a norm ∥·∥ on F. A contraction

mapping T : F → F satisﬁes ∥T (x) − T (y)∥ ≤ β∥x − y∥ ∀x, y, where 0 < β < 1. A point w such that

T (w) = w is called ﬁxed point of T .

Theorem 2.2. Contraction Mapping Principle. Let F be a complete normed vector space (Ba-

nach space). Let T : F → F be a contraction mapping. Then

1. There exists a unique ﬁxed point w such that T (w) = w

2. Take any z ∈ F, we have lim

n→+∞

(n)

(z) → w

Proof. Take any z ∈ F. Then, for any integer m and n in which 0 < m < n we have

∥T

(z) − T

(z)∥ ≤

n−1

i=m

∥T

(z) − T

i+1

(z)∥ ≤

1 − β

∥T

(z) − z∥

In the dynamic programming equation, the LHS is treated as point z and the RHS is treated as T (z).

Here we need to prove that T is a contraction mapping. Suppose there are two point w and z. Then,

we have

T (V

(x)) = min







c(x, u) + β

j∈X

(u)V

(j)







T (V

(x)) = min







c(x, u) + β

j∈X

(u)V

(j)







Suppose that ¯u minimize the second equation. Then we have

T (V

(x)) = min







c(x, u) + β

j∈X

(u)V

(j)







≤ c(x, ¯u) + β

j∈X

(¯u)V

(j)

T (V

(x)) − T (V

(x)) ≤ β

j∈X

(¯u)(V

(j) − V

(j))

Choose inﬁnite norm, we have

∥T (V

) − T (V

)∥

∞

≤ β∥V

− V

∥

∞

How many stationary policies are there? |U|

|X |

. How to search for the optimal policy?

Value Iteration. Using the contraction mapping principle, we can compute the value function for each

policy and choose the optimal one.

Algorithm 1 Value Iteration

Require:

State set X , action set U, cost function c : X × U → R, and β ∈ (0, 1)

procedure ValueIteration(X , U, c, β)

Initialize value function V arbitrarily

while V is not converged do

′

← V

for x ∈ X do

V (x) ← min

{c(x, u) + β

j∈X

(u)V

′

(j)}

return V

Policy Iteration. Given with any stationary policy π : X → U, one can easily solve the value function

of π, V

= [V

(1), . . . , V

(n)], in which n = |X |, through a linear equation.

= (I − βP )

−1

Now choose new policy (using greedy policy)

′

(x) = argmin

u∈U







c(x, u) + β

j∈X

(u)V

(j)







Hence, we have

min

u∈U







c(x, u) + β

j∈X

(u)V

(j)







≥ c(x, π(x)) + β

j∈X

(u)V

(j) = V

(x)

and

) = E

[c(x

, u

) + βV

t+1

) | x

]

≤ E [c(x

, π

′

)) + βV

t+1

) | x

]

= E

′

[c(x

, u

) + βV

t+1

) | x

]

≤ E

′

[c(x

, u

) + βE [c(x

t+1

, π

′

t+1

)) + βV

t+2

) | x

t+1

] | x

]

= E

′

c(x

, u

) + βE

′

[c(x

t+1

, u

t+1

) + βV

t+2

) | x

t+1

] | x

= E

′



c(x

, u

) + βc(x

t+1

, u

t+1

) + β

t+2

) | x



. . .

≤ E

′



c(x

, u

) + βc(x

t+1

, u

t+1

) + β

c(x

t+2

, u

t+2

) + · · · | x



= V

′

)

So in the below algorithm we use the monontone convergence theorem to search the optimal policy

Algorithm 2 Policy Iteration

Require:

State set X , action set U, cost function c : X × U → R, and β ∈ (0, 1)

procedure PolicyIteration(X , U, c, β)

Initialize policy π arbitrarily

while π is not converged do

′

← π

Compute the transition probability matrix P

′

← (I − βP )

−1

for x ∈ X do

π(x) ← argmin

{c(x, u) + β

j∈X

(u)V

′

(j)}

return π

2.2 Dynamic Programming with Average Cost

Consider the following value function

(x) = min

N−1

t=0

c(x(t), u(t)) | x(0) = x

We can derive the dynamic programming equation

(x) = min







c(x, u) +

j∈X

(u)V

N−1

(j)







Then, we have

(x)

= min







c(x, u) +

j∈X

(u)



N−1

(j) − V

(x) +

(x)









Assume that

lim

N→+∞

(x) − N J

⋆

→ w(x), ∀x,

where J

⋆

is the long-term average cost.

⋆

+ w(x) = min







c(x, u) +

j∈X

(u)w(j)







There are |X | equations and |X | + 1 unknowns. So we need to ﬁxed one unknown. For example, we

let w(1) = 0.

Theorem 2.3. Suppose {J

⋆

, w(1), . . . , w(n)} solves the dynamic programming for the long-term av-

erage cost problem. Denote the corresponding policy as π

′

, which solves the ”min” formula for each

x. Then it is the optimal policy that is stationary.

Proof. Let π be any history dependent policy. We have

c(x(t), π(x(t))) ≥ J

⋆

+ w(x(t)) −

j∈X

(π

′

(x(t)))w(j)

N−1

t=0

c(x(t), u(t)) ≥

N−1

t=0

⋆

N−1

t=0

[w(x(t)) − E[w(x(t + 1))]]

(x) ≥ J

⋆

w(x(0)) − w(x(t + 1))

When N goes to inﬁnity, we have

(x) ≥ J

⋆

Policy Iteration Algorithm. Let π be any stationary policy. We solve

⋆

+ w

(x) =







c(x, π(x)) +

j∈X

(π(x))w

(j)







to get {J

⋆

, w

(1), . . . , w

(n)}. Let u = π

′

(x) solves

min







c(x, u) +

j∈X

(u)w

(j)







Let σ(x) be the invariant probability distribution of π

′

x∈X

σ(x)J

x∈X

σ(x)w

(x) ≥

x∈X

σ(x)c(x, π

′

(x)) +

x∈X

σ(x)

j∈X

(π

′

(x))w

(j)

≥

x∈X

σ(x)c(x, π

′

(x)) = J

′