25 June 2022

What if I told you that some of the most used algorithms to find the
shortest path in a graph, calculate gradients while training a neural
network, and parse context-free grammars are essentially implementations
of the same principle? It is called *dynamic programming* and is
one of those instances in mathematics where a simple principle unfolds
into profound conclusions ranging over many fields. In fact, we can,
already in this first paragraph, summarize the idea using Richard
Bellman’s (Dynamic Programming’s creator) own words:

An optimal policy has the property that whatever the initial state and initial decision are, the remaining decisions must constitute an optimal policy with regard to the state resulting from the first decision.

I gotta say that I was taught dynamic programming in different
contexts, but it took me a while to finally get the “click” that they
were actually the same thing. When learning algorithms and data
structures, it was a memoization-based technique where you speed up your
algorithm by first solving the easier parts and saving them for later
use. At work, I mostly deal with solving a lot of linear programs for
long-term scheduling problems.^{1} The main algorithm we
use, called *Stochastic Dual Dynamic Programming*, at first
didn’t seem so much like the programming technique algorithms class.
Finally, one of the main methods for model-based reinforcement learning
is again called dynamic programming, and it also didn’t seem so much
like the other instances.

So, what’s happening here? Did everybody choose to call their
algorithms dynamic programming just because it’s a cool name?^{2} Well, in fact there are some
principles that apply to all of those instances, from planning a
rocket’s trajectory to TeX’s word-wrapping. And the list
goes on and on.

I want to invite you to a journey through many realms of mathematics. We will range from automata to optimal control, passing through Markov chains, dynamical systems, linear programming and even metric spaces. Take your seat and enjoy the ride!

Before delving into dynamic programming per se, we first have to establish a few concepts. After all, it’s always best to know which problems you intend to solve before learning a method to solve them, right?

As a matter of motivation, let’s start with something I am really fond of: old school platformer games. In our hypothetical game which is definitely not about some Italian plumber, the character does stands idle doing nothing by default. But with the press of a button in the controller, the player may command the character to do a few things: shoot, jump or walk. And, of course, each of these actions activate the respective animation on the screen. In the best Resident Evil style, this game only allows a character to shoot while idle and also forces you to first be idle after a jump before doing any other action. Think of that as the time it takes to restore one’s balance after falling. This description may seem overly complicated on text, but fortunately the nice folks in the Comp Sci department already invented diagrams that show these transitions nicely.

Our modeling above is an instance of something called a *state
machines* or *automata* if you’re into Greek words. There are
4 states in which the vampire can be and at each one there is an
available set of actions to take that may transition him to another
state. More abstractly, an automaton is a system that can be in one of
many *states* s \in \mathcal{S}
and at each state, you can choose among a set of *actions* a \in \mathcal{A}(s) that transition the
system to a new state s' = T(s, a),
where T is called the system’s
*transition function*.

An important notice: If you come from Computer Science, you are
probably most used to *finite* state machines. Just know that in
our context, the states may be any set. Some of the algorithms that we
will see today only work for finite state spaces, but there are others
that may even require a continuous space! An example is SDDP, which uses
linear programming duality and thus requires the state space to be a
convex subset of \mathbb{R}^n.

Iterating the transition T establishes a dynamics for our system where we start at an initial state s and by taking a sequence of actions a_1, a_2, \ldots we walk over the state space.

\begin{aligned} s_1 &= s, \\ s_{t+1} &= T(s_t, a_t). \end{aligned}

This new view makes our state machine somewhat equivalent to a controllable dynamical system, which is another really cool name in my opinion.

As an example, think of a game of Sudoku. The states are the (partially numbered) boards and the actions consist of putting a valid number in a certain coordinate. You start with some random board and repeatedly place numbers until you reach a terminal state where there are no available actions.

One can argue that a state encapsulates all you must know about your
system in order to choose an action, no matter the previous history nor
time step. Indeed, if those things affect your choice, then you can
without loss of generality model the problem as a larger system where
the state also carries this information. Thus controlling a dynamic
system amounts to a function \pi : (s :
\mathcal{S}) \to \mathcal{A}(s) which chooses a valid action for
each state. In the literature this called a *policy*, in analogy
to a government taking actions to run its state.

Unfortunately life is not known for its free lunches and in most systems, whenever we take action a at state s, there is a certain cost c(s, a) to pay. Depending on the context this can be, for example, a real monetary cost (for economic problems), some total distance (for planning) or even a negative cost representing a reward.

Thus, by following a policy \pi, we
produce a sequence of cost c(s_t,
\pi(s_t)) for each time step. We could define the total cost for
\pi as the sum of those costs, but
there is an additional detail to notice. If I gave you something and
asked whether you want to pay me today or next year, which option would
you prefer? Sometimes there are factors such as inflation or interests
that make costs in the future not have the same actual value as the
costs we expend right now. This prompts us to introduce a problem
dependent *discount factor* \gamma \in
[0, 1] such that the total cost for a policy \pi is

\begin{array}{rl} v^\pi(s) = & c(s_1, \pi(s_1)) + \gamma c(s_2, \pi(s_2)) + \gamma^2 c(s_3, \pi(s_3)) + \ldots \\ \textrm{where} & s_1 = s, \\ & s_{t+1} = T(s_t, \pi(s_t)), \\ \end{array}

The equation above defines the *value function* v^\pi : \mathcal{S}\to \mathbb{R} for a given
policy \pi. **Spoiler**:
keep an eye on the v^\pi, because later
in this post we will find them to be useful tools closely related to the
memoization techniques that people usually identify with dynamic
programming.

Besides its practical interpretation, the discount factor \gamma also plays a significant role from the mathematical point of view. If |\gamma| < 1 and the costs are uniformly bounded (which is the case for a finite action space, for example) we can guarantee that the series defining v^\pi converges for any policy and initial state. That is, suppose that exists M > 0 such that

\forall s \in \mathcal{S}, a \in \mathcal{A}(s),\, |c(s, a)| \le M.

This bounds the total cost by a geometric series that cannot blow up,

\sum\limits_{t=1}^\infty \gamma^{t-1}|c(s_t, \pi(s_t))| \le \sum\limits_{t=1}^\infty \gamma^{t-1} M \le \frac{M}{1 - \gamma},

thus guaranteeing that the value function is well-defined.

Having multiple courses of action possible prompts us to ask which is
the best one possible. When programming a robot to escape a labyrinth,
you want it to take the least amount of time. When controlling a
spaceship towards the moon, it is important to guarantee that it will
use the least amount of fuel. When brawling at a bar, you want to knock
out your foe with the least injuries possible. Must of all, the best
policy is the one with the least cost taking *all time* into
account; both the present and its future consequences. For example,
sometimes a policy that has a higher cost for the first state is overall
better because it puts us into a more favorable state. Thus, our problem
can be naturally formulated as searching for *optimal
policies*:

Starting at state s, find a policy \pi producing the least total cost over time.

Or equivalently in math language:

\begin{array}{rl} \min\limits_\pi v^\pi(s) = \min\limits_{a_t} & \sum\limits_{t=1}^\infty \gamma^{t-1}c(s_t, a_t) \\ \textrm{s.t.} & s_1 = s, \\ & s_{t+1} = T(s_t, a_t), \\ & a_t \in \mathcal{A}(s_t). \end{array}

Right now, this may seem like a big and scary optimization problem but in fact it contains a lot of structure that we can exploit in order to solve it. This will be the subject of the next section. Before we continue, let’s go over a little tangent on how to formulate some classical problems in this decision making framework.

It is a nice formalism but to be realistic, not all decisions go on forever. I would even risk saying that most of them end after some time. Regardless, the formalism we have developed is powerful enough to also encompass finite-horizon decisions, provided that we make some adjustments to our models.

Consider a process that may end at a certain point but we don’t
really know when. An example would be trying to escape a labyrinth full
of monsters. If a monster catches you its game over and there are no
more decisions to make. We can model this by introducing in our model a
dummy *terminal state* (that we denote as \blacksquare) where there is a single
possible action: to do nothing with zero cost.

Another possibility are problems with a *fixed horizon*, that
is, problems taking a fixed amount N of
steps to end. Since we must know in which stage we are in order to end
the process, we must consider the time step t as part of the state. Then, any state from
the stage N will point towards the terminal state. Thus, given a
transition T for the rest of the
states, our dynamics follows a transition defined as

\bar{T}((t, s), a) = \begin{cases} (t + 1, T(s, a)), & t < N \\ \blacksquare, & t = N. \end{cases}

When drawing diagrams for fixed horizon problems, it is common practice to cluster the states by their stage.

In finite-horizon problems, we may also set the discount factor \gamma to 1. Since all sums eventually end, there are no convergence issues.

Suppose you are at your hometown and just received a message from a friend telling you that there are singing llamas in Cuzco, Peru, right now. This makes you at the same time incredulous and curious, so you just pick your favorite bike and get on the road towards Cuzco. Unfortunately there are no direct bikeways connecting your home to Cuzco, meaning that you will have to find a route going through other cities. Also, there is a risk that the llamas will stop to sing at any time and just go back to their usual behavior of eating grass throughout the mountains. This prompts you to decide to take the shortest possible path to Cuzco.

The above description is an instance of finding the shortest path in a graph. In it, we represent each city by a graph node and direct routes between two cities as a weighted edge where the weight is the distance. Going from home to Cuzco amounts to finding the path between those two nodes with the smallest total distance.

The translation from this graph description to a decision process description is quite straightforward.

**States**: nodes in the graph.**Actions**at state s: edges going from s to another node.**Transition**: The opposite node on the same edge. That is, given an edge s \to s', T(s, s \to s') = s'.**Costs**: c(s, a) is the weight of edge a.

Finding the shortest path from s to node z is the same as setting the initial state to s and making z a terminal state of our dynamics.

Alright, it’s finally time to solve those decision problems. The simplest idea could be to exhaustively search the space of all actions trying to find the best solution. Notice that even for finite states and horizon, this may be prohibitively expensive since the possible candidates grow exponentially with the time steps. Any practical method will take into account how this class of problems naturally breaks apart into separate stages.

Our approach will involve the famous *Bellman principle of
optimality*, which is the cornerstone of dynamic programming. Taking
Richard E Bellman^{3}’s own words, it reads as:

An optimal policy has the property that whatever the initial state and initial decision are, the remaining decisions must constitute an optimal policy with regard to the state resulting from the first decision.

Alright, what does this mean? What the principle of optimality is telling us is that in order to calculate an optimal policy, we should turn this iterative process of making actions and calculating costs into a recursive procedure. That is, taking an action puts us into a new state s_2 = T(s_1, a_1) where we are again faced with the exact same problem of finding an optimal policy but this time starting at s_2. Let’s see how we can exploit this idea.

Remember that we defined the value function v^\pi as the total cost of following a policy
\pi when starting at a given state.
Let’s define the *optimal value function* v^\star as the total cost of choosing the
best course of action while starting at a certain state s.

\begin{array}{rl} v^\star(s) = \min\limits_{a_t} & \sum\limits_{t=1}^\infty \gamma^{t-1}c(s_t, a_t) \\ \textrm{s.t.} & s_1 = s, \\ & s_{t+1} = T(s_t, a_t), \\ & a_t \in \mathcal{A}(s_t). \end{array}

Notice in the optimization problem above that the initial state is
only ever used to choose the first action. Later actions do not depend
directly on it but only on its consequences. This means that we can
break the problem into two parts: calculating an *immediate cost*
dependent on the initial state and calculating a future cost dependent
on all next states.

\begin{array}{rl} v^\star(s) = \min\limits_{a,a_t} & c(s, a) + \left( \begin{array}{rl} \min\limits_{a_t} & \sum\limits_{t=2}^\infty \gamma^{t-1}c(s_t, a_t) \\ \textrm{s.t.} & s_2 = s', \\ & s_{t+1} = T(s_t, a_t), \\ & a_t \in \mathcal{A}(s_t) \end{array} \right) \\ \textrm{s.t.} & s' = T(s, a), \\ & a \in \mathcal{A}(s). \end{array}

There’s already some recursive structure unfolding in here! What is still missing consists of noticing that since the sum in the future cost starts at t =2, we can factor out \gamma. By renaming l = t-1 we get

\sum\limits_{t=2}^\infty \gamma^{t-1}c(s_t, a_t) = \gamma \sum\limits_{t=2}^\infty \gamma^{t-2}c(s_t, a_t) = \gamma \sum\limits_{l=1}^\infty \gamma^{l-1}c(s_l, a_l),

and applying this in the expression for v^\star,

\begin{array}{rl} v^\star(s) = \min\limits_{a} & c(s, a) + \gamma\left( \begin{array}{rl} \min\limits_{a_l} & \sum\limits_{l=1}^\infty \gamma^{l-1}c(s_l, a_l) \\ \textrm{s.t.} & s_1 = s', \\ & s_{l+1} = T(s_l, a_l), \\ & a_l \in \mathcal{A}(s_l) \end{array} \right) \\ \textrm{s.t.} & s' = T(s, a), \\ & a \in \mathcal{A}(s). \end{array}

Although this is a huge expression, it should be straightforward to
see that the expression for the future cost is *exactly* the
optimal value v^\star(s') of
starting the dynamics at s' = T(s,
a). This way, the principle of optimality express itself
mathematically as a recursive equation that the value for an optimal
policy must satisfy.

\begin{array}{rl} v^\star(s) = \min\limits_{a} & c(s, a) + \gamma v^\star(s') \\ \textrm{s.t.} & s' = T(s, a), \\ & a \in \mathcal{A}(s). \end{array}

This is called the *Bellman equation* and all dynamic
programming consists of methods for solving it. Even more: we can think
of the Bellman equation as a recursive specification for the decision
problems and of dynamic programming as any problem-specific
implementation that solves it.

It is time to get deeper into analysis. Whenever mathematicians see a recursive relation such as the Bellman equation, they immediately start asking such things: what guarantees do we have about v^\star? Can I trust that it is unique? Does it even exist? Surely we mathematicians may seem a bit too anxious with all these questions, but they are for good reasons. Besides the guarantee that everything works, proving the existence of a solution in many cases also teaches us how to construct this solution. In fact, this will happen in this case! In the next section we are going to adapt these proofs into algorithms that converge towards the solution to the Bellman equation.

Recursion has a deep relationship with fixed points. For example, we
can adapt the Bellman equation as the fixed point of an operator \mathcal{B} called, you may guess, the
*Bellman Operator*. It takes value functions to value functions
and is defined as

\begin{array}{rl} (\mathcal{B}v)(v) = \min\limits_{a} & c(s, a) + \gamma v(s') \\ \textrm{s.t.} & s' = T(s, a), \\ & a \in \mathcal{A}(s). \end{array}

If we interpret v as an estimate for the total future cost at each state, we can interpret \mathcal{B}v as the value function following the policy that chooses the best action according to this estimate.

Saying that an optimal value function must follow the Bellman equation is the same as saying that it is a fixed point of the Bellman operator:

\mathcal{B}v^\star = v^\star.

We thus reduce the question of existence and uniqueness of solutions
for the Bellman equation to finding fixed points of \mathcal{B}. Fortunately there is a theorem
that does just that! It is called the *Banach Fixed Point*
theorem.

Let (M,\mathrm{d}) be a complete metric space and f : M \to M a function such there is a \gamma \in (0,1) satisfying

\mathrm{d}(f(x), f(y)) \le \gamma \mathrm{d}(x, y)

for any x, y \in M. Then f has a unique fixed point x^\star and for any initial value x_0, iterating f will eventually converge towards x^\star,

\lim_{n \to \infty} f^n(x_0) = x^\star.

Proving this theorem is out of scope for this post^{4}.
However, we can think of it as saying that if a mapping contracts
distances between points, it will have to take everything towards a
single point.

To use it for the Bellman operator, we need to find a suitable metric space and prove that \mathcal{B} is a contraction. The space we will consider are the bounded continuous functions over the states, C^0_b(\mathcal{S}, \mathbb{R}), imbued with the uniform norm

\|f\|_\infty = \sup_{s \in \mathcal{S}} |f(s)|.

This is a complete metric space and is general enough for any
practical problem we may think of. Furthermore, recall that if the state
space is discrete, then *any function is continuous and bounded*,
meaning that this encompasses the most important spaces from a
computational perspective.

To prove that \mathcal{B} is a
contraction, we will start with another of its properties:
*monotonicity*. Besides this proof, it will also be useful in the
future when we talk about policy improvement.

Suppose that we have two value function v and w such that v estimates costs uniformly lower than w:

v(s) \le w(s),\; \forall s \in \mathcal{S}.

Then the Bellman operator preserves this relationship: (\mathcal{B}v)(s) \le (\mathcal{B}w)(s), \; \forall s \in \mathcal{S}.

Given a state s, we get from v \le w that the following inequality holds for any action a \in \mathcal{A}(s):

c(s, a) + \gamma v(T(s,a)) \le c(s, a) + \gamma w(T(s,a)).

Since this is valid for any a, taking the minimum on both sides preserves the inequality. Thus

\min_{a \in \mathcal{A}(s)} c(s, a) + v(T(s,a)) \le \min_{a \in \mathcal{A}(s)} c(s, a) + w(T(s,a)) \\ (\mathcal{B}v)(s) \le (\mathcal{B}w)(s).

Finally, let’s prove that the Bellman operator contracts the space of bounded continuous functions by the discount factor. What we need to show is that for any value function v, w:

\|\mathcal{B}v - \mathcal{B}w\|_\infty \le \gamma \|v - w\|_\infty.

From the definition of the uniform norm, we get that for any state s,

v(s) - w(s) \le \|v - w\|_\infty \\ v(s) \le w(s) + \|v - w\|_\infty.

From the monotonicity we just proved, applying \mathcal{B} to both sides preserves this inequality:

(\mathcal{B}v)(s) \le \mathcal{B}(w + \|v - w\|_\infty)(s).

Let’s show that the constant factor \|v - w\|_\infty only shifts the new function \mathcal{B}w by uniformly by another constant. Calling it k to declutter the notation,

\begin{array}{rlll} (\mathcal{B}(w) + \|v - w\|_\infty)(s) &= &\min\limits_{a} & c(s, a) + \gamma (w(s') + \|v - w\|_\infty) \\ &&\textrm{s.t.} & s' = T(s, a), \\ && & a \in \mathcal{A}(s) \\ &= &\min\limits_{a} & c(s, a) + \gamma (w(s')) + \gamma \|v - w\|_\infty \\ &&\textrm{s.t.} & s' = T(s, a), \\ && & a \in \mathcal{A}(s). \end{array}

This proves that

\begin{aligned} (\mathcal{B}v)(s) &\le (\mathcal{B}w)(s) + \gamma \|v - w\|_\infty \\ (\mathcal{B}v)(s) - (\mathcal{B}w)(s) &\le \gamma \|v - w\|_\infty. \end{aligned}

By doing the same derivation in the opposite direction (for w - v) we get an inequality for the absolute value. Applying the supremum, it becomes the result we want.

\begin{aligned} |(\mathcal{B}v)(s) - (\mathcal{B}w)(s)| &\le \gamma \|v - w\|_\infty \\ \sup_{s\in\mathcal{S}} |(\mathcal{B}v)(s) - (\mathcal{B}w)(s)| &\le \gamma \|v - w\|_\infty \\ \|\mathcal{B}v - \mathcal{B}w\|_\infty &\le \gamma \|v - w\|_\infty. \end{aligned}

Since \mathcal{B} is a contraction, the Banach fixed point theorem guarantees to us that there exists a unique value function v^\star satisfying the Bellman equation. Furthermore, the theorem also points us towards a way to solve this kind of procedure with polynomial complexity on the size of the state and action spaces. This is the topic we’re going to investigate next.

For this section, let’s assume that both the state \mathcal{S} and action \mathcal{A}(s) spaces are finite. This will allow us to focus on exhaustive methods exploring the entire state space. Keep calm however, I am planning to write other posts in the future to explain how these ideas generalize to continuous spaces, or even to finite spaces that are too huge to explore entirely, via something called Approximate Dynamic Programming or Reinforcement Learning.

From the discussion about the existence of an optimal value function,
we learned that iterating the Bellman operator convergences towards the
optimal. Thus, we arrive at our first algorithm: *value
iteration*. Its main idea is to convert the Bellman equation into an
update rule:

v \gets \mathcal{B}v.

We can thus start with an initial value function v_0 and iterate the update rule above. By the magic of the Banach Fixed Point theorem, this will converge towards the optimal value function no matter what the initial value function is. This procedure repeats until the uniform error \| v - \mathcal{B}v \|_\infty becomes less than a previously set tolerance. We can obtain an optimal policy from this value function by solving the decision problem associated with it or by keeping track of the solution each time we solve an optimization problem during value iteration. Below we see a Julia implementation of value iteration.

```
function value_iteration(v0 ; tol)
= copy(v0)
v π = Dict{States, Actions}()
= Inf
maxerr while maxerr > tol
= 0
maxerr for s in States
= v[s]
prev π[s] = findmin(a -> c(s, a) + γ*v[T(s,a)], Actions(s))
v[s], = max(maxerr, abs(v[s] - prev))
maxerr end
end
return π, v
end
```

In the animation below, we can see value iteration in action for a maze solving problem. In this problem, each state is a cell in the grid and the actions are the directions one can take at that cell (neighbours without a wall). Our objective is to reach the right-bottom edge in the minimum amount of steps possible. We do so by starting with a uniformly zero value and a random policy. In the left we see the value function at each iteration and in the right the associated policy.

The algorithm above is in fact just one variation of value iteration. There are still many problem-dependent improvements. For example, we choose to update v in-place, already propagating the new value function while traversing the states, but we could instead have kept the old value function and only updated v after traversing all states. Our approach has the advantage of using the improved information as soon as it is available but updating in batch may be interesting when we’re able to broadcast the optimization across many processes in parallel.

Other important choice we have is the initial value function. Choosing a good warm start can greatly improve the convergence. As an example, whenever there is a terminal state \blacksquare, it is a good idea to already fill v(\blacksquare) = 0. Finally, the order that we traverse the states matter. There is a reason why dynamic programming is famous for solving problems backwards. If we know that a given state is easier to solve, we should start the traverse by it.

For example, our maze solver above could see a lot of improvement by traversing the cells breadth-first from the terminal node. Similarly, in a fixed horizon, the best approach is to start in the states for the last stage and keep going back in time, something we will discuss in the next section. When we have such structure in the state space, traversing this way may even converge to the optimal policy in a single iteration!

Well, we have a lot of options… Nevertheless, as long as we keep visiting all states, any of those approaches is guaranteed to converge towards the optimal value function. Which one is faster being generally problem-dependent. This is why I think it is best to think of DP not as an algorithm but as a principle that encompasses many similar algorithms.

Recall the finite horizon problem that we discussed before. In there, the state had a special structure where it also carried the stage information and eventually transitioned to a terminal state when t became greater than a certain fixed N. As we noticed earlier, the best way to traverse the states is backwards in time. Let’s investigate a bit what this means.

The state space \mathcal{S} is
clustered into smaller spaces for each t: \mathcal{S}_1,
\mathcal{S}_2, \ldots, \mathcal{S}_N, \mathcal{S}_{N+1} =
\{\blacksquare\} where we denote the terminal state by \blacksquare. Also, we can only transition
from \mathcal{S}_t to \mathcal{S}_{t+1}. Let’s thus denote by v_t the value function restricted to \mathcal{S}_t. In the terminal it equals zero
by definition, implying that in the last stage, any action has *no
future cost* since we can only transition to the terminal. Finding
the value function v_N restricted to
the last stage is therefore as simple as finding the best action for
each state in S_N.

This way, after one iteration v_N
already converged. Now that we know it, finding v_{N-1} is reduced to the exact same problem
and we can thus iterate backwards until we solve for the first stage. At
this point, we know by construction that we already converged to the
optimal value function, with no need for further iterations. This
procedure, which is a smart variation of value iteration, is also called
*backward induction* for its nature of propagating information
from the last time step towards the first. In Julia, we write it as
value iteration with a different loop and without the need of an initial
value function.

```
function backward_induction()
= Dict(s => 0 for s in States)
v π = Dict{States, Actions}()
for t in 1:N
for s in States(t)
π[s] = findmin(a -> c(s, a) + γ*v[T(s,a)], Actions(s))
v[s], end
end
return π, v
end
```

One issue with value iteration is that all policy calculations are implicit, since we just work with value functions. It is then possible that we reach an optimal policy and keep iterating the algorithm to improve the estimate for the value function. In this section, let’s see how we can directly calculate an optimal policy in a finite number of steps.

Our next question is then how to calculate the cost associated with a policy. Let’s say somebody gave you a policy \pi and told you nothing more about it. How can you find its value function v^\pi? One way is to notice that it satisfies a recursion similar to the Bellman equation, but without the minimization step.

v^\pi(s) = c(s, \pi(s)) + \gamma v^\pi(T(s, \pi(s)).

If you follow the same steps we did before transforming this equation
into a fixed point problem, you will see that under the same assumptions
of continuity (always valid for finite state and action spaces) it also
has a unique solution. Moreover, turning it into an update procedure
converges towards v^\pi for any initial
value function. Thus, we arrive at an algorithm for evaluating the cost
of a policy, unimaginatively called *policy evaluation*.

```
function policy_evaluation(π, v0=zeros(States); tol)
= copy(v0)
v = Inf
maxerr while maxerr > tol
= 0
maxerr for s in States
= v[s]
prev = c(s, π[s]) + γ*v[T(s, π[s])]
v[s] = max(maxerr, abs(v[s] - prev))
maxerr end
end
return v
end
```

Notice the similarity with value iteration. The only difference is that we are iterating a single evaluation, not an entire optimization problem.

After we know a policy and its value function, our next question is how to improve it. That is, how can we use this information to get nearer to an optimal policy.

The secret lies in locally improving our policy for each state.
Consider a state s. The value v^\pi(s) is the total cost of starting at
s and following \pi thereafter. What if there is an a \in \mathcal{A}(s) such that choosing a at the first step and following \pi thereafter is *cheaper* than just
following \pi?

c(s, a) + v^\pi(T(s, a)) < v^\pi(s).

Since we have this improvement at the first step and our processes are assumed to not depend on anything besides what is represented on the state s, then it must be better to choose a than \pi(s) whenever we are at state s. That is, if we define a new policy

\lambda(x) = \begin{cases} a, & x = s \\ \pi(x), & \text{otherwise} \end{cases}

it is always better to follow \lambda then \pi, because v^\lambda \le v^\pi.

We thus arrive at our next algorithm: *policy iteration*. It
consists of taking a policy \pi,
finding its value function v^\pi
through policy evaluation and finally using policy improvement to arrive
at a better policy. Since there are only finitely many policies, and we
always obtain a strictly better policy, this algorithm is guaranteed to
converge to an optimal policy in a finite amount of steps.

```
function policy_improvement(π0, v)
π = copy(π0)
for s in States
π[s] = argmin(a -> c(s, a) + γ*v[T(s,a)], Actions(s))
end
return π
end
function policy_iteration(π, v=zeros(States); tol)
while true
0 = π
π= policy_evaluation(π, v; tol=tol)
v π = policy_improvement(π, v)
if π == π0 break end
end
return π, v
end
```

Did you notice that we iteratively alternate between two kinds of passes over the states? This is the mother of all backwards-forwards algorithms people generally associate with dynamic programming.

Just like with value iteration, there’s also a lot of freedom in how we traverse the states. Again it is useful to think of policy iteration more as a principle than as an algorithm in itself and adapt the steps to consider any problem specific information that may be available.

Until now, we’ve only dealt with deterministic processes. Life, on the other hand, is full of uncertainty and, as a good applied field, dynamic programming was created from its inception to deal with stochastic settings.

We call a state machine where the transitions T(s, a) and costs c(s, a) are stochastic a *Markov Decision
Process* (MDP for short). This name comes from the fact that the new
state only depends on the current state and action, being independent of
the process’ history just like a Markov chain. A usual intuition for
this kind of processes is as the interaction between an actor and an
environment. At each time step, the environment is at a state s and the actor may choose among different
actions a \in \mathcal{A}(s) to
interact with the environment. This action affects the environment in
some way that is out of reach to the actor (this stochastic /
non-deterministic), changing its state to s' = T(s, a) and incurring a cost of
c(s, a) to the actor, as illustrated in
the diagram below.

Allowing non-determinism opens the way for modeling a lot more cool situations. For example, robots that play video games! The states may be the internal state of the game or some partial observation of them that is available to the player and the actions are the buttons on the controller. The transitions are internal to the game and the costs are related to some winning/losing factor. Have you ever heard of Deep Mind’s Playing Atari with Deep Reinforcement Learning paper? In it, they use reinforcement learning to train a robot capable of playing Atari 2600 games and all modeling is done via Markov decision processes in a way that is really similar to the discussion in this paragraph. I really recommend checking it out.

With a stochastic environment, we can’t necessarily predict the costs and transitions, only estimate them. To accommodate that, we will have to change a few things in our definition of policies and value functions. Those will specialize to our old definitions whenever the MDP is stochastic.

A deterministic policy was a choice of action for each state. We
define a *stochastic policy* as a random function choosing an
action given a state. For the value function, we still want to assign a
total cost to each state. A solution is to consider all possible costs
and take their average. The value of a stochastic policy is thus

v^\pi(s) = \mathbb{E}^\pi\left[c(s, a) + \gamma v^\pi(T(s, a))\right | a = \pi(s)]

where we consider the action a to be randomly sampled with according to the policy and write \mathbb{E}^\pi for the average value considering that we follow it. Similarly, the Bellman equation for the optimal value function also considers the mean cost:

\begin{array}{rl} v^\star(s) = \min\limits_{a} & \mathbb{E}\left[c(s, a) + \gamma v^\star(s')\right] \\ \textrm{s.t.} & s' = T(s, a), \\ & a \in \mathcal{A}(s). \end{array}

In this post we will only work the expected value but know that if you were feeling particularly risk-averse, you could exchange it for any coherent risk measure and all results would still hold.

If the (possibly random) states, actions and costs are finite, the
methods of *value iteration* and *policy iteration* also
hold with minor adjustments. One can prove the existence and convergence
of the stochastic Bellman equation with the same assumptions as
before.

The main difference from the deterministic case is that we need to
evaluate the expectation during the process. Let’s call p(s', c | s, a) the probability of
getting a transition s' = T(s, a)
and a cost c(s, a) given that we choose
action a at state s. With this, the expectation becomes a
weighted sum and we can do *value iteration* exactly as before,
with the only difference that our update rule becomes

v(s) \gets \min\limits_{a} \sum_{s',\,c} p(s', c | s, a)\left(c + \gamma v(s')\right).

Similarly, the update rules for policy evaluation is

v(s) \gets \sum_{s',\,c} p(s', c | s, \pi(s))\left(c + \gamma v(s')\right)

and the one for policy improvement is

\pi(s) \gets \argmin\limits_{a} \sum_{s',\,c} p(s', c | s, a)\left(c + \gamma v(s')\right).

The algorithm for policy iteration is also exactly the same as the deterministic one except for these two lines of code.

Well, we finally reached the end of our overview of dynamic
programming. I hope it was as fun to read as it was for me to write. And
that DP gets the honor place it deserves in your problem solving.^{5} Of course, a single blog post is too
tinny to encompass a subject as vast as DP. So let’s give a little
appetizer of what is out there.

One thing that we barely scraped the surface was continuous state problems, although they being quite important. And I say this last part seriously as someone who works daily with them. In our theoretical discussions, we didn’t really use the finitude of the states, only when we we built the algorithms with vectors and loops. This means that at a mathematical level, both value and policy iteration work for infinite states if you keep in mind the continuity assumptions.

Therefore, if one has a way to traverse the entire state space \mathcal{S}, the algorithms will converge and
we may even get an analytic solution to the Bellman Equation. In
practice this is hard and we have to resort to some kind of
*Approximate Dynamic Programming*. The idea is to approximate the
value function v with some
representation requiring only a finite amount of information and sample
the state space in order to fit it with the data.

The best way to do that is problem-dependent, of course. For example, if the optimization in the Bellman equation is a linear program,

\begin{array}{rl} v(s) = \min\limits_{a} & c^ta + \gamma v(s') \\ \textrm{s.t.} & s' = Ms + Na, \\ & a \ge 0, \end{array}

one can guarantee that v must be convex and piecewise-linear. Thus, at each iteration of our algorithm we can solve for state s^{(i)} and use linear programming duality to get an affine subapproximation to v (called a cutting plane),

v(s) \ge v^{(i)} + \left\langleλ^{(i)}, s - s^{(i)}\right\rangle, \forall s \in \mathcal{S}.

Hence instead of solving the recursive equation, we sample the state space and use this cut approximation of v:

\begin{array}{rl} v(s) = \min\limits_{a} & c^ta + \gamma z \\ \textrm{s.t.} & s' = Ms + Na, \\ & a \ge 0, \\ & z \ge 0, \\ & z \ge v_i + λ(s - s_i), \forall i. \end{array}

This procedure yields piecewise-linear approximations that eventually
converge to the real optimal value function. The idea we just discussed
forms the basis of *Dual Dynamic Programming*, a method largely
employed by people in Stochastic Programming and Operations
Research.

Another thing we can do when the problem’s structure is not as nice
as a linear program is to use some kind of universal approximator for
v, such as a neural network. This is
called *Neural Dynamic Programming* and the work of Dmitri Bertsekas is
a great place to start learning about this.

Besides the algorithms, there are other classes of problems that
admit a Bellman equation but didn’t fit our narrative despite their
importance. One such example are Game problems where we do not have
total control over the actions taken. Differently from MDPs where we
know the transition probabilities for all states, in this case there are
also other players who are supposedly trying to *maximize* your
cost. As an example, think of a game of Poker where the other players
want to maximize your losses (thus minimize their own). In this case,
besides your actions a, there are also
actions \alpha from the other players
who affect the state transition s' = T(s,
a, \alpha) providing a Bellman equation

\begin{array}{rl} v(s) = \min\limits_{a} \max\limits_{\alpha} &\mathbb{E}[c(s, a) + \gamma v(s')] \\ \textrm{s.t.} & s' = T(s, a, \alpha), \\ & a \in \mathcal{A}_\mathrm{you}(s) \\ & \alpha \in \mathcal{A}_\mathrm{enemy}(s). \end{array}

Another important instance that I didn’t mention before are control problems with continuous time. This is a huge area of engineering and applied math and much of Bellman’s original work was focused on solving this kind of problem. As an example, think of an airplane flying. It’s dynamics are given by a differential equation taking account both its current state s and how the pilot controls the pane a,

\frac{ds}{dt} = T(s, a).

Think of s_t as the plane’s position and direction, while a_t represents how much the pilot chooses to accelerate or turn the plane’s yoke at time t. The cost may be the fuel consumption. As any good physics problem this has continuous time and the sum over states becomes an integral

\begin{array}{rl} v(s) = \min\limits_{a} & \int_{0}^{\infty} c(s(t), a(t))dt \\ \textrm{s.t.} & \frac{ds}{dt} = T(s(t), a(t)), \\ & s(0) = s, \\ & a \in \mathcal{A}(s). \\ \end{array}

By following similar steps to our deduction of the discrete time
Bellman equation plus a lot of hard analysis^{6},
one arrives at a non-linear partial differential equation called the
*Hamilton-Jacobi-Bellman equation*:

v(s) = \min_{a} c(s, a) + \left\langle \nabla v(s), T(s, a) \right\rangle.

This equation analogously describes the optimal value of a state recursively but now it is much harder to solve. The methods employed generally involve discretization allied with, our by now old friends, value or policy iteration.

After this final overview, it’s indeed time to end our journey, nevertheless knowing that there is much more to explore out there. I will try to post more about dynamic programming and specially its connections to Stochastic Optimization and Reinforcement Learning but I don’t really know when I will have the time.

Farewell and see you next time!

This post come to life after a series of conversations I had with Pedro Xavier. The good thing of explaining something to a smart person is that you end up learning a lot in the process. Sometimes you even learn enough to write a blog post about it.

I’m also in debt with Ivani Ivanova for being such a great typo hunter. If there is any typo left, it is because I’m lazy… She did a marvelous job.

To be more precise, we work with hydrothermal dispatch problems, where one must decide between many sources of energy (hydro, thermal, renewable) to supply a certain power demand taking into account the uncertainties of the future. For example: hydro is cheap and clean, but you risk running out of water if you use all of it and next month turns out particularly dry. Finding the best energy dispatch is once again solved via dynamic programming.↩︎

Even Richard Bellman admittedly chose because of that.↩︎

*Dynamic Programming*, Princeton Landmarks in Mathematics and Physics (Princeton, NJ: Princeton University Press, 1957), ch 3, p. 83.↩︎It is out of scope more because it is a tangent to the topic than because of any difficulty in the proof. If you are interested in analysis, I really recommend you to try proving it. The proof consists of using the contraction property to show that the distance between the iterations of f must converge towards zero.↩︎

In fact, fixed points and recursion deserve the spot. They’re everywhere!↩︎

The derivation mostly consists of Taylor expansions to be sincere. The tricky part is justifying your steps.↩︎