# A Tale of Dynamic Programming

25 June 2022

What if I told you that some of the most used
algorithms to find the shortest path in a graph,
calculate gradients while training a neural network, and
parse context-free grammars are essentially
implementations of the same principle? It is called
*dynamic programming* and is one of those
instances in mathematics where a simple principle
unfolds into profound conclusions ranging over many
fields. In fact, we can, already in this first
paragraph, summarize the idea using Richard Bellman’s
(Dynamic Programming’s creator) own words:

An optimal policy has the property that whatever the initial state and initial decision are, the remaining decisions must constitute an optimal policy with regard to the state resulting from the first decision.

I have to admit that despite encountering dynamic
programming in different contexts, it took me a while to
finally get the “click” that they were actually the same
thing. When learning algorithms and data structures, it
was a memoization-based technique where you could speed
up some algorithms by first solving the easier parts and
storing the solution for later use. Then, at work, I
mostly deal with solving a lot of linear programs for
long-term scheduling problems.^{1} The main algorithm
we use, called *Stochastic Dual Dynamic
Programming*, at first didn’t seem so much like the
programming technique from the algorithms class.
Finally, one of the main methods for model-based
reinforcement learning is again called dynamic
programming, and it also didn’t seem so much like the
other instances.

So, what’s happening here? Did everybody choose to
call their algorithms dynamic programming just because
it’s a cool name?^{2} Well, in
fact there are some principles that apply to all of
those instances, from planning a rocket’s trajectory to
TeX’s word-wrapping. And the list
goes on and on.

I want to invite you to a journey through many realms of mathematics. We will range from automata to optimal control, passing through Markov chains, dynamical systems, linear programming and even metric spaces. Take your seat and enjoy the ride!

## On Decision-Making and State Machines

Before delving into dynamic programming per se, we first have to establish a few concepts. After all, it’s always best to know which problems you intend to solve before learning a method to solve them, right?

As a matter of motivation, let’s start with something I am really fond of: old school plataformer games. In our hypothetical game which is definitely not about some Italian plumber, the character stands idle doing nothing by default. But with the press of a button in the controller, the player may command the character to do a few things: shoot, jump or walk. And, of course, each of these actions activate the respective animation on the screen. In the best Resident Evil style, this game only allows a character to shoot while idle and also forces you to first be idle after a jump before doing any other action. Think of that as the time it takes to restore one’s balance after falling. This description may seem overly complicated on text, but fortunately the nice folks in the Comp Sci department already invented diagrams that show these transitions nicely.

Our modeling above is an instance of something called
a *state machine* or *automata* if you’re
into Greek words. There are 4 states in which the
character might be and at each one there is an available
set of actions to take that transitions that state. More
abstractly, an automaton is a system that can be in one
of many *states* s \in
\mathcal{S} and at each state, you can choose
among a set of *actions* a \in \mathcal{A}(s).
Whenever you take an action, the system changes to a new
state according to a *transition function*

T : (s : \mathcal{S}) \times \mathcal{A}(s) \to \mathcal{S}.

Unfortunately life is not known for its free lunches
and, in general, whenever one takes action a at state s, it is necessary to pay a
certain *cost*, properly modeled as another
function

c : (s : \mathcal{S}) \times \mathcal{A}(s) \to \mathbb{R}.

Depending on the context this can be, for example, a real monetary cost (in economic contexts), some total distance or elapsed time (for planning) or even a negative cost representing a reward.

### The Dynamics of Decision-Making

Iterating the transition T establishes a dynamics for our system: by starting at an initial state s_0 and taking a sequence of actions \{a_t\}, we generate a trajectory over the state space.

s_{t+1} = T(s_t, a_t).

When viewed in this light, our state machines are
called *controllable dynamical systems* or
*decision processes*, which are yet additional
cool names for you to memorize.

As an example, think of a game of Sudoku. The states are the (partially numbered) boards and the actions consist of putting a valid number in a certain coordinate. You start with some random board and repeatedly place numbers until you reach a terminal state where there are no available actions.

One can argue that a state encapsulates all you must know about your system in order to choose an action, no matter the previous history nor time step. Indeed, if any other thing affects your choice, you can, without loss of generality, model the process as a larger automaton where the state also carries the additional information. Thus, controlling a dynamic system amounts to selecting a valid action for each state, that is, a function

\pi : (s : \mathcal{S}) \to \mathcal{A}(s).

In the literature this is called a *policy*,
in analogy to a government taking actions to control the
state of the nation.

Starting at state s_0 and following a policy \pi produces a deterministic dynamical system without the need for choosing a control:

s_{t+1} = T(s_t, \pi(s_t)).

This dynamics, in counterpart, yields a cost c(s_t, \pi(s_t)) for each time step. We could define the total cost for \pi as the sum of those costs, but there is an additional detail to notice. Suppose that, for any reason, money is short and you had to take a loan in order to pay your bills. In those struggling conditions, would you prefer to pay the money back today or next year?

Sometimes there are factors such as inflation or
interests making future costs have a real value that is
different from its nominal value. This prompts us to
introduce a problem dependent *discount factor*
\gamma \in [0, 1]
representing how much the cost depreciates over time.
The total cost of following a certain policy \pi is the cumulative sum of
all properly discounted costs we generate by following
it. We define the *value function* v^\pi : \mathcal{S}\to
\mathbb{R} associated with \pi as the total cost of
starting at a given state:

\begin{array}{rl} v^\pi(s) = & c(s_0, \pi(s_0)) + \gamma c(s_1, \pi(s_1)) + \gamma^2 c(s_2, \pi(s_2)) + \ldots \\ \textrm{where} & s_0 = s, \\ & s_{t+1} = T(s_t, \pi(s_t)), \\ \end{array}

Besides from its practical interpretation, the discount factor \gamma also plays a significant role from the analytical point of view. If |\gamma| < 1 and the costs are uniformly bounded (which is the case for a finite action space, for example) we can guarantee that the series defining v^\pi converges for any choice of actions and initial state. That is, suppose there exists M > 0 such that

\forall s \in \mathcal{S}, a \in \mathcal{A}(s),\, |c(s, a)| \le M.

This bounds the total cost by a geometric series that cannot blow up,

\sum\limits_{t=0}^\infty \gamma^{t}|c(s_t, a_t)| \le \sum\limits_{t=0}^\infty \gamma^{t} M \le \frac{M}{1 - \gamma},

thus guaranteeing that the value function is well-defined.

### Optimal Decisions

Having multiple possible courses of action prompts us
to ask which one is the best. When programming a robot
to escape a labyrinth, you want it to take the least
amount of time; When controlling a spaceship towards the
moon, it is important to guarantee that it will use the
least amount of fuel; When brawling at a bar, you want
to knock out your foe while sustaining the least
injuries possible. Most of all, the best policy is the
one with the least cost taking *all time* into
account — both the present and its future consequences.
For example, sometimes a policy that has a higher cost
for the first state is overall better because it puts us
into a more favorable state. Thus, our problem can be
naturally formulated as searching for *optimal
policies*:

Starting at state s, find a policy \pi producing the least total cost over time.

Or equivalently in math language:

\begin{array}{rl} \min\limits_\pi v^\pi(s) = \min\limits_{a_t} & \sum\limits_{t=0}^\infty \gamma^{t}c(s_t, a_t) \\ \textrm{s.t.} & s_0 = s, \\ & s_{t+1} = T(s_t, a_t), \\ & a_t \in \mathcal{A}(s_t). \end{array}

Right now, this may seem like a big and scary optimization problem but in fact it contains a lot of structure we’re able to exploit. That is the subject of the next section. But, before we continue, let’s go over a little tangent on how to formulate some classical problems in this decision-making framework.

#### Example: Shortest Path in a Graph

Suppose you are at your hometown and just received a message from a friend telling you that there are singing llamas in Cuzco, Peru, right now. This makes you at the same time incredulous and curious, so you just pick your favorite bike and get on the road towards Cuzco. Unfortunately there are no direct bikeways connecting your home to Cuzco, meaning that you will have to find a route going through other cities. Also, there is a risk that the llamas will stop to sing at any time and just go back to their usual behavior of eating grass throughout the mountains. This prompts you to decide to take the shortest possible path to Cuzco.

The above description is an instance of finding the shortest path in a graph. In it, we represent each city by a graph node and direct routes between two cities as a weighted edge where the weight is the distance. Going from home to Cuzco amounts to finding the path between those two nodes with the smallest total distance.

The translation from this graph description to a decision process is quite straightforward.

**States**: nodes in the graph.**Actions**at state s: edges going from s to another node.**Transition**: the opposite node on the same edge. That is, given an edge s \to s', T(s, s \to s') = s'.**Costs**: c(s, a) is the weight of edge a — the time taken traveling through that edge.

Finding the shortest path from s to z is the same as setting the initial state to s and making z a terminal state of our dynamics.

## Dynamic Programming

Alright, it’s finally time to start optimizing those decision problems. The simplest idea would be to exhaustively search the space of all actions trying to find the best solution. Notice that even for finite states and horizon, this may be prohibitively expensive since the possible candidates grow exponentially with the time steps. Any practical method must take into account how this class of problems naturally breaks apart into separate stages.

Our approach will involve the famous *Bellman
principle of optimality*, which is the cornerstone
of dynamic programming. Taking Richard E Bellman^{3}’s own words,
it reads as:

An optimal policy has the property that whatever the initial state and initial decision are, the remaining decisions must constitute an optimal policy with regard to the state resulting from the first decision.

Alright, what does this mean? What the principle of optimality is telling us is that in order to calculate an optimal policy, we should turn this iterative process of making actions and calculating costs into a recursive procedure. That is, taking an action a at the initial state s puts us into a new state s' = T(s, a) where we are again faced with the exact same problem of finding an optimal policy, but this time starting at s'. Let’s see how we can exploit this idea.

Remember that we defined the value function v^\pi as the total cost of
following a policy \pi
when starting at a given state. Let’s define the
*optimal value function* v^\star as the total cost of
choosing the best course of action while starting at a
certain state s.

\begin{array}{rl} v^\star(s) = \min\limits_{a_t} & \sum\limits_{t=0}^\infty \gamma^{t}c(s_t, a_t) \\ \textrm{s.t.} & s_0 = s, \\ & s_{t+1} = T(s_t, a_t), \\ & a_t \in \mathcal{A}(s_t). \end{array}

Notice in the optimization problem above that the
initial state is only ever used to choose the first
action. Later actions do not depend directly on it but
only on its consequences. This means that we can break
the problem into two parts: calculating an *immediate
cost* dependent only on the initial state and
calculating a *future cost* dependent on all the
ensuing states.

\begin{array}{rl} v^\star(s) = \min\limits_{a,a_t} & {c(s, a)} + \left( \begin{array}{rl} \min\limits_{a_t} & \sum\limits_{t=1}^\infty \gamma^{t}c(s_t, a_t) \\ \textrm{s.t.} & s_1 = s', \\ & s_{t+1} = T(s_t, a_t), \\ & a_t \in \mathcal{A}(s_t) \end{array} \right) \\ \textrm{s.t.} & s' = T(s, a), \\ & a \in \mathcal{A}(s). \end{array}

There’s already some recursive structure unfolding in here! What is still missing consists of noticing that since the sum in the future cost starts at t = 1, we can factor out \gamma. By renaming l = t-1 we get

\sum\limits_{t=2}^\infty \gamma^{t-1}c(s_t, a_t) = \gamma \sum\limits_{t=2}^\infty \gamma^{t-2}c(s_t, a_t) = \gamma \sum\limits_{l=1}^\infty \gamma^{l-1}c(s_l, a_l),

and applying this in the expression for v^\star,

\begin{array}{rl} v^\star(s) = \min\limits_{a} & c(s, a) + \gamma\left( \begin{array}{rl} \min\limits_{a_l} & \sum\limits_{l=0}^\infty \gamma^{l}c(s_l, a_l) \\ \textrm{s.t.} & s_0 = s', \\ & s_{l+1} = T(s_l, a_l), \\ & a_l \in \mathcal{A}(s_l) \end{array} \right) \\ \textrm{s.t.} & s' = T(s, a), \\ & a \in \mathcal{A}(s). \end{array}

Although this is a huge expression, it should be
straightforward to see that the expression for the
future cost is *exactly* the optimal value v^\star(s') of starting
the dynamics at s' = T(s,
a). This way, the principle of optimality express
itself mathematically as a recursive equation that the
value for an optimal policy must satisfy.

\boxed{ \begin{array}{rl} v^\star(s) = \min\limits_{a} & c(s, a) + \gamma v^\star(s') \\ \textrm{s.t.} & s' = T(s, a), \\ & a \in \mathcal{A}(s). \end{array} }

This is called the *Bellman equation* and all
dynamic programming consists of methods for solving it.
Even more: we can think of the Bellman equation as a
recursive specification for the decision problems and of
dynamic programming as any problem-specific
implementation that solves it.

### Existence, Uniqueness and Fixed Points

It is time to get deeper into analysis. Whenever mathematicians see a recursive relation such as the Bellman equation, they immediately start asking such things: what guarantees do we have about v^\star? Can I trust that it is unique? Does it even exist? Surely we mathematicians may seem a bit too anxious with all these questions, but they are for good reasons. Besides the guarantee that everything works, in this case proving the existence of a solution also teaches us how to construct it! So pay attention, because in the next section we’re going to adapt the theorems in here into algorithms to solve the Bellman equation.

Recursion has a deep relationship with fixed points,
allowing us to use whichever is more practical to our
goals. To solve a problem with dynamic programming, our
first step will be to write the Bellman equation as the
fixed point of an operator \mathcal{B}: (\mathcal{S}\to
\mathbb{R}) \to (\mathcal{S}\to \mathbb{R})
called — guess what — the *Bellman Operator*. It
takes value functions to value functions and is defined
as

\begin{array}{rl} (\mathcal{B}v)(s) = \min\limits_{a} & c(s, a) + \gamma v(s') \\ \textrm{s.t.} & s' = T(s, a), \\ & a \in \mathcal{A}(s). \end{array}

Now, the optimal value function v^* is the fixed point We thus reduce the question of existence and uniqueness of solutions for the Bellman equation to finding fixed points of \mathcal{B}:

v^* = \mathcal{B}v^*.

If you are not accustomed to fixed points, the transition above from the Bellman equation to operator may seem strange. Hence, let’s go through a little story in order to develop some intuition.

Imagine that you are the king/queen of a fantastical kingdom. You are an absolute monarch and whatever action you decide to take, your subjects will follow. Lately, the kingdom’s reserves are running dry and your counselors advised you to define a clear governing policy in order to minimize the kingdom’s spending. Since besides a ruthless ruler you’re also a great mathematician and a fan of this blog, at this point you already know what must be done to save your kingdom from bankruptcy: solve the Bellman equation.

Because at this point of the post you still don’t know how to solve it, you decide to take advantage of your fantastical world and hire a wizard to look into his crystal ball and act as an oracle telling you how much each state of affairs will cost to the kingdom in the future. After God knows how many rituals and incantations, the wizard hands you a shining new value function \char"1f52e: \mathcal{S}\to \mathbb{R}.

Since it is never wise to blindly follow the advices from a crystal ball, you decide to use it only to predict the future, while relying on your discernment for taking immediate decisions. In other words, you rule your kingdom by solving the optimization problem

\begin{array}{rl} (\mathcal{B}\char"1f52e)(s) = \min\limits_{a} & c(s, a) + \gamma \char"1f52e(s') \\ \textrm{s.t.} & s' = T(s, a), \\ & a \in \mathcal{A}(s). \end{array}

Thus, the process of going from prediction to decision is precisely the Bellman operator. The function \mathcal{B}v is the cost of choosing the best action taking v as your future estimate.

#### A Useful Theorem

Alright, we have transformed the problem of finding an optimal decision into solving the Bellman equation, and then transformed it again into finding a fixed point to the Bellman operator,

\mathcal{B}v^\star = v^\star.

If this seems as complicated — although more abstract
— as where we started, have no fears! We just arrived at
a point where we can invoke a powerful theorem from
Analysis to solve all our issues at once. Behold the
*Banach Fixed Point Theorem*!

In a complete metric space (M,\mathrm{d}), any continuous f : M \to M that decreases the distance between points:

\mathrm{d}(f(v), f(w)) \le \gamma \mathrm{d}(v, w),\; \textrm{for } \gamma \in [0, 1),

Has a unique fixed point v^\star.

Furthermore, one can arrive at v^\star from any initial value v_0 by iterating f:

\lim_{n \to \infty} f^n(v_0) = v^\star,\; \forall v_0 \in M.

This procedure converges linearly with the error at each iteration bounded as

\mathrm{d}(v_n, v^*) \le \frac{\gamma^n}{1 - \gamma} \mathrm{d}(v_0, f(v_0)).

Proving this theorem is out of scope for this post^{4}. However, we can
think of it as saying that if a mapping shrinks all
distances, then eventually the image of all points will
be squeezed into a single point.

What we care the most about this theorem is that it gives us a constructive recipe to find fixed points on metric spaces. All we have to do is to interpret the fixed point equation as an update rule,

v \gets f(v),

And iterate it until the distance converges to below a certain tolerance. It is then straightforward to convert the description above into a computational procedure.

```
function fixed_point(f; v0, tol)
= f(v0)
v while distance(v, v0) > tol
= v
v0 = f(v) # Update rule
v end
return v
end
```

#### A Metric Space of Value Functions

To apply the Banach fixed point theorem to the Bellman operator, we must find a suitable function space where \mathcal{B} is a contraction. A fitting choice are the bounded continuous functions over the states, C^0_b(\mathcal{S}, \mathbb{R}) with distance given by the uniform norm

\mathrm{d}(v, w) = \|v - w\|_\infty = \sup_{s \in \mathcal{S}} |v(s) - w(s)|.

In this (complete) metric space, \mathcal{B} turns out to be a
contraction with factor \gamma (yes, the discount
factor). Furthermore, for finite state spaces, *any
function is continuous and bounded*, meaning that
the above encompasses all algorithms of interest in this
post.

Since I don’t want to depart too much from the post’s main topic nor dive into mathematical minutiae, we are going to relegate the necessary proofs to an appendix.

Any decision process with discount factor \gamma < 1 has a unique optimal value function v^\star satisfying the Bellman equation

\begin{array}{rl} v^\star(s) = \min\limits_{a} & c(s, a) + \gamma v^\star(s') \\ \textrm{s.t.} & s' = T(s, a), \\ & a \in \mathcal{A}(s). \end{array}

Furthermore, you can calculate the (not necessarily unique) optimal policy via

\begin{array}{rl} \pi^\star(s) = \argmin\limits_{a} & c(s, a) + \gamma v^\star(s') \\ \textrm{s.t.} & s' = T(s, a), \\ & a \in \mathcal{A}(s). \end{array}

It all follows from applying the Banach fixed point theorem to the Bellman operator.

The above result is what I like to call a “bazooka theorem”, because besides guaranteeing the existence and uniqueness of an optimal value function — and consequently an optimal policy — it also teaches us how to calculate it for finite states, as we will shortly see in the ensuing section.

## Solving the Bellman Equation

Dynamic Programming consists of solving the Bellman Equation, and, as with all famous equations, there are many possible approaches. Which one to choose will depend on the problem and hardware at hand.

From now on, let’s assume that both the state \mathcal{S} and action \mathcal{A}(s) spaces are
*finite*. This allows us to focus on exhaustive
methods exploring the entire state space. There are
other methods, such as Reinforcement Learning or Dual
Dynamic Programming, which are able to generalize the
ideas in here to infinite spaces. But this is a story
for another night…

Before jumping into the algorithms, it is worth discussing a couple of technical decisions we must make in order to implement them.

This first thing we must notice is that although the
most straightforward way to represent functions in a
programming language is through computational
procedures^{5}, it
would be quite inefficient in our case. This happens
because altering procedures for improvement is
computationally expensive. Thus, by efficiency reasons,
it is customary to represent the policy and value
function not as functions but using some other data
structure. Since our state space is finite, there is a
wide range of data structures capable of exactly
representing such functions. Common choices are arrays
or hash maps, but you can really use anything capable of
storing coefficients. Below we see some examples with
these types for in-memory storage.

```
# Storage with vector / array
# WARNING: This requires some method idx : States -> Int for later indexing
function asarray(f :: Function)
return [f(s) for s in States]
end
# Storage with hash map / dictionary
function asdictionary(f :: Function)
return Dict(s => f(s) for s in States)
end
```

The memoization that people associate with dynamic programming lies entirely in this “trick”. However, it is good to keep in mind that this is only a matter of computational representation of functions and is totally orthogonal to any algorithmic design. In a language with first-class functions, it is possible to do dynamic programming using only function composition. It just so happens that it will not be as fast as one would like.

In the algorithms, we will write this choice of
representation as two opaque types
`Values{States}`

and
`Policy{States, Actions}`

, assumed to deal
with all boilerplate as needed.

Another important thing to mention is that we will
also only interact with a process via its *total
cost*, never touching its constituent parts
separately^{6}. Hence,
let’s already build a function that does the conversion
for us.

```
# Turn a decision problem into its respective cost function.
function total_cost(p :: Process)
return (v, s, a) -> p.cost(s, a) + p.γ * v[p.next(s, a)]
end
```

Notice how we used square brackets to represent that we are accessing a data structure instead of calling a function.

### Value Iteration

Thus, we arrive at our first algorithm: *value
iteration*. Recall from the previous discussion that
iterating the Bellman operator over any input converges
towards the optimal value function. The algorithms main
idea comes quite straightforwardly from it: convert the
Bellman equation into an update rule to find its fixed
point.

v \gets \mathcal{B}v.

We can thus start with any initial value function v_0 and iterate the update rule above. By the magic of the Banach Fixed Point theorem, this will converge towards the optimum. This procedure repeats until the uniform error \| v - \mathcal{B}v \|_\infty becomes less than a previously set tolerance.

Each iteration of our algorithm comes from evaluating
the Bellman operator in our previously chosen
representation. Let’s thus use our
`total_cost`

to write it.

```
# The Bellman operator corresponding to a decision process.
# It uses a storage representation `Values` for the value function.
function bellman_operator(prob :: Process)
return function(v)
= Values{States}() # Empty representation
Bv for s in States
= minimum(a -> total_cost(prob)(v, s, a), Actions(s))
Bv[s] end
return Bv
end
end
```

Finally, the algorithm consists of iterating the
procedure generated by `bellman`

until it
converges to the fixed point. Afterwards, we calculate
the policy as the optimal solution for the determined
value function. The name *value iteration* is
because it only uses the value function in the update
process, with the calculated policy playing no role.

```
function value_iteration( prob :: Process # Data for decision process
= zeros(States) # Warm start --- all zeros if you don't know any better
; v0 # Stopping tolerance
, tol) # The optimal value function is the fixed point of the Bellman Operator
= fixed_point(bellman_operator(prob); v0, tol)
v_opt
# The optimal policy is the choice of action for the total cost with the optimal value function.
= Policy{States, Actions}()
π_opt for s in States
π[s] = argmin(a -> total_cost(p)(v_opt, s, a), Actions(s))
end
return π_opt, v_opt
end
```

The algorithm above comes from directly implementing the Fixed Point Theorem, and, because of this, is guaranteed to converge linearly to the optimum. At each iteration, we do one evaluation of the Bellman operator requiring \mathrm{O}(|\mathcal{S}|\cdot||Actions|) operations. Nevertheless, at each iteration, the minimization procedures happen independently for each state, making the evaluation of \mathcal{B}v embarrassingly parallel on the states.

### In-place Value Iteration

Despite the parallelization opportunities shown by the previous implementation, it can feel too sluggish when implemented sequentially, because it waits until after traversing all states to update the value function. Another approach, better suited for a sequential machine, is to update the value function in-place in order to promptly propagate the improved information to the other states. The trade-off is that this approach is no longer able to broadcast the optimization across many processes in parallel.

Algorithmically speaking, the only required change is rewriting the fixed point iteration procedure to calculate in-place.

```
function fixed_point_inplace!(f, v; tol)
= Inf
maxerr while maxerr > tol
= 0 # Start with smallest error possible
maxerr for s in States
= v[s]
prev = f(v)[s]
v[s] # Estimate ||f(v) - v||_∞ component by component
= max(maxerr, abs(v[s] - prev))
maxerr end
end
return v
end
function value_iteration_inplace!(prob, v0 = zeros(States) ; tol)
operator(v) = Values(minimum(a -> total_cost(prob)(v, s, a), Actions(s)) for s in States)
= fixed_point_inplace!(operator, v0 ; tol)
v_opt
= Policy{States, Actions}()
π_opt for s in States
π[s] = argmin(a -> total_cost(p)(v_opt, s, a), Actions(s))
end
return π_opt, v_opt
end
```

In the animation below, we can see in-place value iteration in action for the problem of escaping from a maze. In this model, each state is a cell in the grid and the actions are the directions one can take at that cell (neighbours without a wall). The objective is to reach the right-bottom edge in the minimum amount of steps possible. We do so by starting with a uniformly zero value and a random policy. In the left we see the value function at each iteration and in the right the associated policy.

The implementations above are, indeed, just variations on the same idea: iterating the Bellman operator in order to converge to the optimal value function. There are many other tweaks we could make to it which, nevertheless, don’t affect the algorithm’s essence: choosing good warm starts, parallelizing, changing the state traversal order in the in-place version, etc. The best approach tends to be problem-dependent.

### Policy Iteration

One issue with value iteration is that all policy
calculations are implicit, since we just work with value
functions. Therefore, it is possible to reach an optimal
policy but keep iterating the algorithm because the
value function has not converged yet. In this section,
we will get acquainted with *policy iteration*,
an algorithm that uses policies to calculate value
functions and value functions to calculate policies
until it converges towards the optimal. It’s selling
point is being able to directly calculate an optimal
policy in a finite number of steps.

#### Policy Evaluation

Let’s say somebody gave you a policy \pi and told you nothing more about it. How can you calculate the value function v^\pi associated with it? One way is to notice that it satisfies a recursion similar to the Bellman equation, but without the minimization step.

\begin{array}{rl} v^\pi(s) = & c(s, a) + \gamma v^\pi(s') \\ & \quad\textrm{where}\; s' = T(s, \pi(s)). \end{array}

We can also transform this equation into a fixed point problem by defining an operator

\begin{array}{rl} (\mathcal{B}^\pi v)(s) = & c(s, a) + \gamma v(s') \\ & \quad\textrm{where}\; s' = T(s, \pi(s)). \end{array}

The above can be readily written as a computational procedure:

```
function policy_bellman_operator(prob :: Process, pi :: Policy)
return function(v)
= Values{States}() # Empty representation
Bv for s in States
= total_cost(prob)(v, s, pi[s])
Bv[s] end
return Bv
end
end
```

Now, we can look at \mathcal{B}^\pi as the cost for a decision process where the action set for each state was collapsed into a single option. Hence, we know that, under the same assumptions as before, it has a unique fixed point.

Moreover, turning it into an update procedure
converges towards v^\pi
for any initial value function. This way, we arrive at
an algorithm for evaluating the cost of a policy,
unimaginatively called *policy evaluation*.

```
function policy_evaluation(prob :: Process
π :: Policy
, = zeros(States)
; v0
, tol)return fixed_point(policy_bellman_operator(prob, pi); v0, tol)
end
```

Notice the similarity with value iteration. The only true difference is on which operator we pass to the fixed point: instead of choosing an optimal action, it just follows along with the policy. Moreover, all the discussion above on the variations of value iteration also holds for policy evaluation. You can do the same modifications with same effects.

#### Policy Improvement

After we know a policy’s value function, our next question is how to update it into a better policy. That is, how can we use this information to get nearer to the optimal.

Remember from the previous discussions that applying the Bellman operator \mathcal{B} to any non-optimal value function produces a strictly better result. It can, therefore, improve a policy’s value function.

(\mathcal{B}v^\pi)(s) = \min_{a \in \mathcal{A}(s)} c(s, a) + v^\pi(T(s, a)) \le c(s, \pi(s)) + v^\pi(T(s, \pi(s))) = v^\pi(s).

The value function \mathcal{B}v^\pi encodes the cost of choosing the best action right now while following \pi on all future steps. We can get a policy from it by taking the solution to the optimization problem.

\begin{array}{rl} \pi'(s) = \argmin\limits_{a} & c(s, a) + \gamma v^\pi(s') \\ \textrm{s.t.} & s' = T(s, a), \\ & a \in \mathcal{A}(s). \end{array}

Since, unless \pi
was optimal, the equation above generates a strictly
better policy. From it we can define a procedure, called
*policy_improvement*, which turns a value
function v into a
policy \pi that is
better than whatever policy v represented.

```
function policy_improvement(prob :: Process, v :: Values)
π = Policy{States, Actions}()
for s in States
π[s] = argmin(a -> total_cost(v_π, s, a), Actions(s))
end
return π
end
```

#### Alternating Evaluation and Improvement

By starting from any random policy \pi_0, and alternatively running policy evaluation and improvement, we generate a sequence of policies and value functions

\pi_0 \xrightarrow{\textrm{evaluation}} v^{\pi_0} \xrightarrow{\textrm{improvement}} \pi_1 \xrightarrow{\textrm{evaluation}} v^{\pi_1} \xrightarrow{\textrm{improvement}} \ldots

The value functions are monotonically decreasing while the policies strictly improve until converging to the optimum.

We thus arrive at another dynamic programming
algorithm: *policy iteration*. It consists of
iteratively taking a policy \pi, finding its value
function v^\pi through
policy evaluation and finally using policy improvement
to arrive at a better policy. Since there are only
finitely many policies, and we always obtain a strictly
better policy, this algorithm is guaranteed to converge
to an optimal policy in a finite amount of steps.

```
function policy_iteration(prob :: Process
= zeros(States)
; v0 0 = rand(Policy{States, Actions)
, π
, tol)= policy_evaluation(prob, π_0 ; v0 = v_0, tol = tol)
v π = policy_improvement(prob, v)
while π != π0
0 = π
π# Use previous v as warm start
= policy_evaluation(prob, π ; v0 = v, tol = tol)
v π = policy_improvement(prob, v)
end
return π, v
end
```

Just like value iteration, policy iteration also accepts many variations on how we traverse the states. The implementation above is close to the theory and is embarrassingly parallel on both the evaluation and the improvement step. Nevertheless, it is useful to think of policy iteration more as an algorithmic principle than as an algorithm itself and adapt the steps to consider any problem specific information that may be available.

### Backward Induction over a Finite Horizon

Until now, we haven’t assumed anything about the
decision process. By exploiting the problem’s state
space structure, we can turn these algorithms into much
faster ones. In this section we deal with *finite
horizon problems* and show that, for them, we can
make value iteration converge in a single iteration!

When the dynamics ends at a certain number N of time steps, we say that
the problem has a finite horizon. In this case, the
stage t is part of the
state variable, (Because how else would we know when to
stop?) and there is a *terminal state* \blacksquare with zero cost
and representing anything that happens after the
dynamics end. Essentially, we work for N stages and then reach \blacksquare, where we can
just relax and do nothing for the rest of eternity.

If you prefer equations to figures, a finite horizon state machine is one where the transition function looks like this:

\bar{T}((t, s),\, a) = \begin{cases} (t + 1, T(s, a)), & t < N \\ \blacksquare, & t = N. \end{cases}

But what is so special about finite horizons? After all, the equation above seems much more confusing than what we had before. Well… what we gain is that the state space \mathcal{S} is clustered into smaller spaces that are visited in sequence: \mathcal{S}_1, \mathcal{S}_2, \ldots, \mathcal{S}_N, \mathcal{S}_{N+1} = \{\blacksquare\}.

The *backward induction* algorithm consists of
value iteration but exploiting the structure we just
discussed. At the terminal state, the cost is always
zero, so we can set v(\blacksquare) = 0. But
this, means that the *future cost* for all states
in \mathcal{S}_N is
fixed, and the optimal policy for them is just to choose
the cheapest action. Now that we know which action to
choose in \mathcal{S}_N, the future is
fixed for all states in \mathcal{S}_{N-1}, reducing
the problem in them to the routine we just did. We
repeat this procedure until we reach the first stage.
This calculates an optimal policy by induction, but
moving “backwards” in time.

In Julia code, the algorithm looks like this:

```
function backward_induction()
= Values{States}()
v π = Policy{States, Actions}()
for t in N:1
for s in States(t)
π[s] = findmin(a -> total_cost(v, s, a), Actions(s))
v[s], end
end
return π, v
end
```

I recommend you to compare this with the generic
value iteration in order to see what we gain. One thing
should be clear: backward induction does the exact same
operation as value iteration for each state, but only
requires a single pass over the state space. This makes
it much more efficient. Another more subtle detail is
that since we have more structure to exploit in the
dynamics, the algorithm doesn’t have to assume so much
about the Bellman operator. For example, although
hand-wavy, we just gave a proof of convergence above
that has no dependence on the Banach fixed-point
theorem. Thus, as a bonus, backward induction works for
any discount factor \gamma.^{7}

## Nondeterminism: One Cannot Know It All

Ok, it’s time to recapitulate a bit. We started our journey looking at automata and controllable dynamics, showing how we can represent the best set of actions as those who fulfill a certain recursive relation on their value functions, called the Bellman equation. Then, we explored a couple ways of solving this equation through methods for finding fixed points. A detail that stands out on all of that, thought, is that we always had perfect information about our systems.

In the real world, there are no guarantees that taking an action will put you in a determinate state. In general, there is a myriad of possible states — perhaps all of them — that the system could transition to. Let’s model this by modifying the transition function’s type to

T : (s : \mathcal{S}) \times \mathcal{A}(s) \to M \mathcal{S}

where M converts a
type to some uncertain context.^{8} Common examples of
such operators are

- M is the power set \mathcal{P}. In this case, the transition output enumerates all possible ensuing states, defining a nondeterministic automaton.
- M takes sets to probability distributions over them. This way, we have stochastic systems with multiple possible futures. Differently from the previous example, though, in this case there is a notion of how probable is each following state.
- M is the identity. This way, M \mathcal{S}= \mathcal{S}. This serves to show that the deterministic case is also encompassed the nondeterministic one.
- M represents actions taken by others. In many situations — such as a game — there are other players besides you who are capable of taking actions modifying the states. Since their actions also influence on the outcomes, the transition can only return a function \mathcal{A}(s) \to \mathcal{S}.

To deal with nondeterministic transitions, we need some way to take a value function and aggregate it over all possible costs for the uncertain future into a single real number,

\rho : (\mathcal{S}\to \mathbb{R}) \times M \mathcal{S}\to \mathbb{R}.

The function \rho depends on which uncertainty context we’re dealing with, and will shortly see some examples of it. Using it, we can define a Bellman operator

\begin{array}{rl} (\mathcal{B}v)(s) = \min\limits_{a} & c(s, a) + \gamma \rho(v, s') \\ \textrm{s.t.} & s' = T(s, a), \\ & a \in \mathcal{A}(s). \end{array}

Quite similar to what we had before, don’t you think?
With some mild conditions over \rho, all derivations on the
appendix for existence and
uniqueness of solutions still work in this context.
Moreover, all our algorithms — value iteration, policy
iteration, backwards induction — only used the system’s
total cost, which is still a deterministic function.
Thus we can apply them to this context *without any
need for modification*. How cool is that?

### Example: Recurrence Equations and Fibonacci

What would be of a dynamic programming tutorial without calculating the good ol’ Fibonacci numbers, right? This example is actually pretty simple and using the full power we’ve developed is a real overkill. Nevertheless, as it is a rather common first example to encounter when learning dynamic programming in Computer Science classes, it is worth it to view this kind of problem in the light of the formalism we’ve developed.

A recurrence relation is any function \N \to \mathbb{R}, where the nth term is recursively defined through the previous terms.

f(n) = \begin{cases} c_n, & n < k \\ g(n, f(n-1), \ldots, f(n-k)), & n \ge k. \end{cases}

The c_n are constants corresponding to the base case of the recursion while other values f(n) depends directly on the previous k terms.

Famous examples with this structure are the factorial and Fibonacci functions:

\begin{align*} \mathrm{fat}(n) &= \begin{cases} 1, & n = 0 \\ n \cdot \mathrm{fat}(n-1), & n \ge 1. \end{cases} \\ \mathrm{fib}(n) &= \begin{cases} 0, & n = 0 \\ 1, & n = 1 \\ \mathrm{fib}(n-1) + \mathrm{fib}(n-2), & n \ge 2. \end{cases} \\ \end{align*}

In general, evaluating f(n) directly via the definition may be exponentially slow. But dynamic programming allows you to calculate it time that’s linear in n.

The idea is to define a nondeterministic decision process whose Bellman equation is exactly the recurrence relation. Then, the optimal value function will equal f for all values until the N we want to evaluate. The states are the numbers 0,\ldots, N while there is only a single dummy action \blacklozenge. For the transition, we follow the indices recurrence using the power set as our source of nondeterminism. Notice that this is a finite horizon process.

T(s, \blacklozenge) = \begin{cases} \emptyset, & s < k \\ \{s-1, s-2, \ldots, s-k\}, & s \ge k. \end{cases}

For the immediate cost, we use the base cases c_n for the first k stages while the others are all zero.

c(s, \blacklozenge) = \begin{cases} c_s, & s < k \\ 0, & s \ge k. \end{cases}

And to aggregate over the set of future indices, what a better choice than using the relation g itself?

\rho(v, s') = g(s, \{ v(n) \mid n \in s' \}).

With this setup, this system’s Bellman equation looks a lot like the original recurrence.

v(s) = \min_{a \in \{\blacklozenge\}} \begin{cases} c_s, & s < k \\ g(s, v(s-1), \ldots, v(s-k)), & s \ge k. \end{cases}

Since, there is only a single action, the minimization is redundant and its fixed point satisfies the original recurrence. Since the problem at hand has a finite horizon, we can solve it via Value Iteration or Backwards Induction even without a discount factor \gamma.

As an example, let’s calculate the first 15 Fibonacci numbers. The following video shows the steps for value iteration.

Since the horizon is finite, we can improve it even more by using Backwards Induction! The process we’ve just constructed has a single state per stage where we consider the initial state to be N and the final one 0 (which means going backwards, in this case). We can, therefore, use backwards induction to evaluate the Fibonacci equation in exactly n steps.

### Stochastic Dynamic Programming

Being a nice applied field, dynamic programming has since its inception kept in mind stochasticity. The real world is full of uncertainty and what’s better to model not knowing the future than probability itself?

To make a automaton stochastic, the transition must
return a probability distribution over the states. In
the literature, you will find these system with the name
*Markov Decision Process* (MDP for short). This
name comes from the fact that the new state only depends
on the current state and action, being independent of
the process’ history; Just like a Markov chain — which
are MDPs with a single action. A usual intuition for
this kind of processes is as the interaction between an
actor and an environment. At each time step, the
environment is at a state s (which is known to the
actor), and the actor may choose among different actions
a \in \mathcal{A}(s),
each one incurring a certain cost, to interact with it.
This action affects the environment in some way that is
out of reach to the actor (thus stochastic /
non-deterministic), changing its state to s' = T(s, a). Before this
transition happens, we can just estimate what s' will be with
uncertainty. This is illustrated in the diagram below.^{9}

Allowing stochasticity opens the way for modeling a
lot more cool situations. For example, robots that play
video games! The states may be the internal state of the
game or some partial observation of them that is
available to the player and the actions are the buttons
on the controller. The transitions are internal to the
game and the costs are related to some winning/losing
factor. Have you ever heard of Volodymyr Mnih et al.^{10}’s
*Playing Atari with Deep Reinforcement Learning*
paper? In it, they use reinforcement learning to train a
robot capable of playing Atari 2600 games and all
modeling is done via Markov decision processes in a
really similar way to the discussion in this paragraph.
I really recommend checking it out.

Since we cannot know the future, the usual tool is to
take all possible outcomes and average their cost. This
gives us an aggregating function to extract a
deterministic value from a stochastic transition.^{11}

\rho(v, S) = \mathbb{E}[v(S)].

*Stochastic dynamic programming* is the field
dedicated to optimizing the expected total cost of a
Markov Decision Processes over all possible
policies.

v^\star(s) = \min_\pi v^\pi(s) = \min_{a_t} \mathbb{E}\left[ \sum\limits_{t=0}^\infty \gamma^{t}c(s_t, a_t) \,\middle|\, s_0 = s,\, s_{t+1} = T(s_t, a_t) \right]

As you probably expected, the above is equivalent to a Bellman equation. Since the expectation is linear, the derivation is rather similar to the one we did for deterministic processes.

\begin{array}{rl} v^\star(s) = \min\limits_{a} & c(s, a) + \gamma \mathbb{E}\left[v^\star(s') \right] \\ \textrm{s.t.} & s' = T(s, a), \\ & a \in \mathcal{A}(s). \end{array}

For finite state spaces, this equation has the same contraction properties as the deterministic one. Hence, the tools of value and policy iteration are also readily available to solve MDPs. Their computational complexity is worse, because calculating the mean makes the update rule v \gets \mathcal{B}v require \mathrm{O}(|\mathcal{S}|^2|\cdot|\mathcal{A}|) operations.

## End of our Journey

Well, we finally reached the end of our overview of
dynamic programming. I sincerely hope it was as fun for
you to read as it was for me to write. And that DP gets
the honor place it deserves in your problem solving
toolkit!^{12}

Of course, a single blog post is too tinny to encompass a subject as vast as DP. There are matters about estimating the value function instead of simply calculating it, infinite state spaces, continuous time and a plethora of cool stuff we can do. There are also a lot of connections with reinforcement learning, for which we only scraped the surface in this post. Unfortunately, these will remain as stories for another night.

Farewell and see you next time!

## Acknowledgements

This post come to life after a series of conversations I had with Pedro Xavier. The good thing of explaining something to a smart person is that you end up learning a lot in the process. Sometimes you even learn enough to write a blog post about it.

I’m also in debt with Ivani Ivanova for being such a great typo hunter. If there is any typo left, it is because I’m lazy… She did a marvelous job.

## Appendix (Convergence in Infinite Horizon)

In this appendix we show that the Bellman Operator

\begin{array}{rl} (\mathcal{B}v)(s) = \min\limits_{a} & c(s, a) + \gamma v(s') \\ \textrm{s.t.} & s' = T(s, a), \\ & a \in \mathcal{A}(s) \end{array}

is a *monotone contraction* over the space of
bounded continuous functions.

We begin our proof with *monotonicity*. For
that, let’s introduce a partial order on the space of
value function \mathcal{S}\to
\mathbb{R} given via uniform ordering on all
states,

v \le w \iff \forall s \in \mathcal{S},\, v(s) \le w(s).

The Bellman Operator preserves uniform ordering of value functions:

v \le w \implies \mathcal{B}v \le \mathcal{B}w.

The hypothesis v \le w implies for any state s and action a that

c(s, a) + \gamma v(T(s,a)) \le c(s, a) + \gamma w(T(s,a)).

Since this is valid for any a, taking the minimum on both sides preserves the inequality.

\min_{a \in \mathcal{A}(s)} c(s, a) + v(T(s,a)) \le \min_{a \in \mathcal{A}(s)} c(s, a) + w(T(s, a)) \\ (\mathcal{B}v)(s) \le (\mathcal{B}w)(s).

The line above is valid for all states, thus concluding the proof.

Another important property of \mathcal{B} is that uniform translations of the input v also translate the output uniformly.

For any constant k, \mathcal{B}(v + k) = \mathcal{B}v + \gamma k.

\begin{array}{rlll} \mathcal{B}(v + k)(s) &= &\min\limits_{a} & c(s, a) + \gamma (v(s') + k) \\ &&\textrm{s.t.} & s' = T(s, a), \\ && & a \in \mathcal{A}(s) \\ &=& \min\limits_{a} & c(s, a) + \gamma v(s') + \gamma k\\ &&\textrm{s.t.} & s' = T(s, a), \\ && & a \in \mathcal{A}(s). \end{array}

Since the term \gamma k does not depend on the action a, we may take it out of the optimization,

\mathcal{B}(v + k)(s) = \mathcal{B}(v)(s) + \gamma k.

This concludes the theorem.

Finally, let’s prove that the Bellman operator contracts the space of bounded continuous functions by the discount factor.

the Bellman Operator contracts the space of continuous bounded functions with Lipschitz constant \gamma,

\|\mathcal{B}v - \mathcal{B}w\|_\infty \le \gamma \|v - w\|_\infty.

When \gamma < 1, it is a contraction.

From the definition of the uniform norm, we get that for any state s,

v(s) - w(s) \le \|v - w\|_\infty \\ v(s) \le w(s) + \|v - w\|_\infty.

From the monotonicity we just proved, applying \mathcal{B} to both sides preserves this inequality:

(\mathcal{B}v)(s) \le \mathcal{B}(w + \|v - w\|_\infty)(s).

And since the right-hand side above has a uniform translation, we can take the constant out:

\begin{aligned} (\mathcal{B}v)(s) &\le (\mathcal{B}w)(s) + \gamma \|v - w\|_\infty \\ (\mathcal{B}v)(s) - (\mathcal{B}w)(s) &\le \gamma \|v - w\|_\infty. \end{aligned}

Using that the norm is symmetric, we can do the same derivation in the opposite direction (for w - v) to get an inequality for the absolute value. Finally, applying the supremum, it becomes the result we want.

\begin{aligned} |(\mathcal{B}v)(s) - (\mathcal{B}w)(s)| &\le \gamma \|v - w\|_\infty \\ \sup_{s\in\mathcal{S}} |(\mathcal{B}v)(s) - (\mathcal{B}w)(s)| &\le \gamma \|v - w\|_\infty \\ \|\mathcal{B}v - \mathcal{B}w\|_\infty &\le \gamma \|v - w\|_\infty. \end{aligned}

Finally, from the Banach fixed point theorem and the above, we conclude that whenever \gamma < 1, the operator \mathcal{B} has a unique fixed point. Hence, any decision process with a discount factor is solvable and has a unique optimal value function v^\star.

To be more precise, we work with hydrothermal dispatch problems, where one must decide between many sources of energy (hydro, thermal, renewable) to supply a certain power demand taking into account the uncertainties of the future. For example: hydro is cheap and clean, but you risk running out of water if you use all of it and next month turns out particularly dry. Finding the best energy dispatch is once again solved via dynamic programming.↩︎

Even Richard Bellman admittedly named it based on how cool it sounds.↩︎

*Dynamic Programming*, Princeton Landmarks in Mathematics and Physics (Princeton, NJ: Princeton University Press, 1957), ch 3, p. 83.↩︎It is out of scope more because it is a tangent to the topic than because of any difficulty in the proof. If you are interested in analysis, I really recommend you to try proving it. The main idea is using the contraction property to show that the distance between the iterations of f must converge towards zero.↩︎

They are even called

*functions*in most programming languages.↩︎You may wonder if this means that dynamic programming also works for decision processes whose costs satisfy other kinds of Bellman equations (non-additive costs, for example) and the answer is yes! A great place to start is Dimitri P. Bertsekas,

*Abstract Dynamic Programming*(Belmont, Mass: Athena Scientific, 2013)’s book. Just be warned that there are a lot more technicalities to deal with in those cases.↩︎In contrast with value iteration, it also has no dependence on the costs being real numbers. In this case, any closed Semiring would do. But I digress… This is out of scope for this post.↩︎

The technical definition is that M should be a Monad. But discussing those details is out of scope here. See this other post on autamata with contexts for a discussion of this structure in a similar context.↩︎

Loosely based on the diagram found in Richard S. Sutton and Andrew G. Barto,

*Reinforcement Learning: An Introduction*, Second edition, Adaptive Computation and Machine Learning Series (Cambridge, Massachusetts: The MIT Press, 2018).↩︎“Playing Atari with Deep Reinforcement Learning,” December 2013, https://doi.org/10.48550/ARXIV.1312.5602.↩︎

Although the average is the most common choice, it is far from the only one. Many real-life situations call for risk-aversion and dynamic programming with coherent risk measures works equally well.↩︎

In fact, fixed points and recursion as a whole deserve the spot. They’re everywhere!↩︎