min
u
"
d (x(T )) +
Z
T −1
t=0
c (x(t), u(t), t) t
#
.
Define V (x, T ) similarly as above. Then, we have
V (x, t) = min
u
c(x, u, t − 1)h + V (x + f (x, u, t − 1)h, t + h)
V (x, t) = min
u
c(x, u, t − 1)h + V (x, t) +
∂V (x, t)
∂t
f(x, u, t − 1)h +
∂V (x, t)
∂t
h
V (x, t)
∂t
= − min
u
c(x, u, t − 1) +
∂V (x, t)
∂t
f(x, u, t − 1)
Stochastic System. In this scenario V (x, t) is the expected cost-to-go. We have a controlled Markov
chain in which the transition probability is given by p
ij
(u), u ∈ U, which means one can alter the
transition probability. Then, we have
V (x, t) = min
u
c(x, u, t) +
X
j∈X
p
xj
(u)V (j, t + 1)
The expected cost is given by
E
"
T −1
X
t=0
c(x(t), u(t), t) + d(x(T )) | x(0) = x
0
#
But how do you choose your action (u(0), u(1), . . . , u(t − 1))? We need policy.
Policy
• History-dependent policy: choose the action based on the past H
t
= (x(0), u(0), . . . , x(t −
1), u(t − 1)) and u(t) = g
t
(H
t
) where g
t
= (g
0
, . . . , g
t−1
).
• State-dependent/Markov chain policy: u(t) = g
t
(x(t)).
• Randomized policy: choose probability distribution of actions instead of actions.
Theorem 1.1. Define γ
⋆
= (γ
⋆
1
, . . . , γ
⋆
t−1
) where γ
⋆
t
: X → U as a state-dependent policy such that u
minimizes
h
c(x, u, t) +
P
j∈X
p
xj
(u)V (j, t + 1)
i
for each (x, t). Then, we have the expected cost
V (x, t) = E
γ
⋆
"
T −1
X
n=t
c(x(n), u(n), n) + d(x(T )) | x(t) = x
t
#
is optimal in the class of all history-dependent policy γ for all x
t
.
Proof. Let γ = (γ
1
, . . . , γ
t−1
) denote a history-dependent policy. Then the expected cost is
V
γ
t
= E
γ
"
T −1
X
n=t
c(x(n), u(n), n) + d(x(T )) | H
t
#
2