25 June 2022

What if I told you that some of the most used algorithms to find the
shortest path in a graph, calculate gradients while training a neural
network, and parse context-free grammars are essentially implementations
of the same principle? It is called *dynamic programming* and is
one of those instances in mathematics where a simple principle unfolds
into profound conclusions ranging over many fields. In fact, we can,
already in this first paragraph, summarize the idea using Richard
Bellman’s (Dynamic Programming’s creator) own words:

An optimal policy has the property that whatever the initial state and initial decision are, the remaining decisions must constitute an optimal policy with regard to the state resulting from the first decision.

I gotta say that I was taught dynamic programming in different
contexts, but it took me a while to finally get the “click” that they
were actually the same thing. When learning algorithms and data
structures, it was a memoization-based technique where you speed up your
algorithm by first solving the easier parts and saving them for later
use. At work, I mostly deal with solving a lot of linear programs for
long-term scheduling problems.^{1} The main algorithm we
use, called *Stochastic Dual Dynamic Programming*, at first
didn’t seem so much like the programming technique algorithms class.
Finally, one of the main methods for model-based reinforcement learning
is again called dynamic programming, and it also didn’t seem so much
like the other instances.

So, what’s happening here? Did everybody choose to call their
algorithms dynamic programming just because it’s a cool name?^{2} Well, in fact there are some
principles that apply to all of those instances, from planning a
rocket’s trajectory to TeX’s word-wrapping. And the list
goes on and on.

I want to invite you to a journey through many realms of mathematics. We will range from automata to optimal control, passing through Markov chains, dynamical systems, linear programming and even metric spaces. Take your seat and enjoy the ride!

Before delving into dynamic programming per se, we first have to establish a few concepts. After all, it’s always best to know which problems you intend to solve before learning a method to solve them, right?

As a matter of motivation, let’s start with something I am really fond of: old school platformer games. In our hypothetical game which is definitely not about some Italian plumber, the character does stand idle doing nothing by default. But with the press of a button in the controller, the player may command the character to do a few things: shoot, jump or walk. And, of course, each of these actions activate the respective animation on the screen. In the best Resident Evil style, this game only allows a character to shoot while idle and also forces you to first be idle after a jump before doing any other action. Think of that as the time it takes to restore one’s balance after falling. This description may seem overly complicated on text, but fortunately the nice folks in the Comp Sci department already invented diagrams that show these transitions nicely.

Our modeling above is an instance of something called a *state
machine* or *automata* if you’re into Greek words. There are
4 states in which the character might be and at each one there is an
available set of actions to take that transitions that state. More
abstractly, an automaton is a system that can be in one of many
*states* s \in \mathcal{S} and
at each state, you can choose among a set of *actions* a \in \mathcal{A}(s) that transition the
system to a new state s' = T(s, a),
where T is called the system’s
*transition function*.

An important notice: If you come from Computer Science, you are
probably used to *finite* state machines. Just know that in our
context, the states may be any set. Some of the algorithms that we will
see today only work for finite state spaces, but there are others that
may even require a continuous space! An example is Stochastic Dual
Dynamic Programming, which is based upon linear programming duality and
thus requires the state space to be a convex subset of \mathbb{R}^n.

Iterating the transition T establishes a dynamics for our system where we start at an initial state s and by taking a sequence of actions a_1, a_2, \ldots we walk over the state space.

\begin{aligned} s_1 &= s, \\ s_{t+1} &= T(s_t, a_t). \end{aligned}

This new view makes our state machine somewhat equivalent to a
*controllable dynamical system*, which is another really cool
name in my opinion.

As an example, think of a game of Sudoku. The states are the (partially numbered) boards and the actions consist of putting a valid number in a certain coordinate. You start with some random board and repeatedly place numbers until you reach a terminal state where there are no available actions.

One can argue that a state encapsulates all you must know about your
system in order to choose an action, no matter the previous history nor
time step. Indeed, if those things affect your choice, then you can
without loss of generality model the problem as a larger system where
the state also carries this information. Thus controlling a dynamic
system amounts to a function \pi : (s :
\mathcal{S}) \to \mathcal{A}(s) which chooses a valid action for
each state. In the literature this called a *policy*, in analogy
to a government taking actions to run its state.

Unfortunately life is not known for its free lunches and in most systems, whenever we take action a at state s, there is a certain cost c(s, a) to pay. Depending on the context this can be, for example, a real monetary cost (for economic problems), some total distance (for planning) or even a negative cost representing a reward.

Thus, by following a policy \pi, we
produce a sequence of cost c(s_t,
\pi(s_t)) for each time step. We could define the total cost for
\pi as the sum of those costs, but
there is an additional detail to notice. If I gave you something and
asked whether you want to pay me today or next year, which option would
you prefer? Sometimes there are factors such as inflation or interests
that make costs in the future not have the same actual value as the
costs we expend right now. This prompts us to introduce a problem
dependent *discount factor* \gamma \in
[0, 1] such that the total cost for a policy \pi is

\begin{array}{rl} v^\pi(s) = & c(s_1, \pi(s_1)) + \gamma c(s_2, \pi(s_2)) + \gamma^2 c(s_3, \pi(s_3)) + \ldots \\ \textrm{where} & s_1 = s, \\ & s_{t+1} = T(s_t, \pi(s_t)), \\ \end{array}

The equation above defines the *value function* v^\pi : \mathcal{S}\to \mathbb{R} for a given
policy \pi. **Spoiler**:
keep an eye on the v^\pi, because later
in this post we will find them to be useful tools closely related to the
memoization techniques that people usually identify with dynamic
programming.

Besides its practical interpretation, the discount factor \gamma also plays a significant role from the mathematical point of view. If |\gamma| < 1 and the costs are uniformly bounded (which is the case for a finite action space, for example) we can guarantee that the series defining v^\pi converges for any policy and initial state. That is, suppose that exists M > 0 such that

\forall s \in \mathcal{S}, a \in \mathcal{A}(s),\, |c(s, a)| \le M.

This bounds the total cost by a geometric series that cannot blow up,

\sum\limits_{t=1}^\infty \gamma^{t-1}|c(s_t, \pi(s_t))| \le \sum\limits_{t=1}^\infty \gamma^{t-1} M \le \frac{M}{1 - \gamma},

thus guaranteeing that the value function is well-defined.

Having multiple courses of action possible prompts us to ask which is
the best one possible. When programming a robot to escape a labyrinth,
you want it to take the least amount of time. When controlling a
spaceship towards the moon, it is important to guarantee that it will
use the least amount of fuel. When brawling at a bar, you want to knock
out your foe with the least injuries possible. Must of all, the best
policy is the one with the least cost taking *all time* into
account; both the present and its future consequences. For example,
sometimes a policy that has a higher cost for the first state is overall
better because it puts us into a more favorable state. Thus, our problem
can be naturally formulated as searching for *optimal
policies*:

Starting at state s, find a policy \pi producing the least total cost over time.

Or equivalently in math language:

\begin{array}{rl} \min\limits_\pi v^\pi(s) = \min\limits_{a_t} & \sum\limits_{t=1}^\infty \gamma^{t-1}c(s_t, a_t) \\ \textrm{s.t.} & s_1 = s, \\ & s_{t+1} = T(s_t, a_t), \\ & a_t \in \mathcal{A}(s_t). \end{array}

Right now, this may seem like a big and scary optimization problem but in fact it contains a lot of structure that we can exploit in order to solve it. This will be the subject of the next section. Before we continue, let’s go over a little tangent on how to formulate some classical problems in this decision making framework.

Suppose you are at your hometown and just received a message from a friend telling you that there are singing llamas in Cuzco, Peru, right now. This makes you at the same time incredulous and curious, so you just pick your favorite bike and get on the road towards Cuzco. Unfortunately there are no direct bikeways connecting your home to Cuzco, meaning that you will have to find a route going through other cities. Also, there is a risk that the llamas will stop to sing at any time and just go back to their usual behavior of eating grass throughout the mountains. This prompts you to decide to take the shortest possible path to Cuzco.

The above description is an instance of finding the shortest path in a graph. In it, we represent each city by a graph node and direct routes between two cities as a weighted edge where the weight is the distance. Going from home to Cuzco amounts to finding the path between those two nodes with the smallest total distance.

The translation from this graph description to a decision process description is quite straightforward.

**States**: nodes in the graph.**Actions**at state s: edges going from s to another node.**Transition**: The opposite node on the same edge. That is, given an edge s \to s', T(s, s \to s') = s'.**Costs**: c(s, a) is the weight of edge a.

Finding the shortest path from s to node z is the same as setting the initial state to s and making z a terminal state of our dynamics.

Alright, it’s finally time to solve those decision problems. The simplest idea could be to exhaustively search the space of all actions trying to find the best solution. Notice that even for finite states and horizon, this may be prohibitively expensive since the possible candidates grow exponentially with the time steps. Any practical method will take into account how this class of problems naturally breaks apart into separate stages.

Our approach will involve the famous *Bellman principle of
optimality*, which is the cornerstone of dynamic programming. Taking
Richard E Bellman^{3}’s own words, it reads as:

An optimal policy has the property that whatever the initial state and initial decision are, the remaining decisions must constitute an optimal policy with regard to the state resulting from the first decision.

Alright, what does this mean? What the principle of optimality is telling us is that in order to calculate an optimal policy, we should turn this iterative process of making actions and calculating costs into a recursive procedure. That is, taking an action puts us into a new state s_2 = T(s_1, a_1) where we are again faced with the exact same problem of finding an optimal policy but this time starting at s_2. Let’s see how we can exploit this idea.

Remember that we defined the value function v^\pi as the total cost of following a policy
\pi when starting at a given state.
Let’s define the *optimal value function* v^\star as the total cost of choosing the
best course of action while starting at a certain state s.

\begin{array}{rl} v^\star(s) = \min\limits_{a_t} & \sum\limits_{t=1}^\infty \gamma^{t-1}c(s_t, a_t) \\ \textrm{s.t.} & s_1 = s, \\ & s_{t+1} = T(s_t, a_t), \\ & a_t \in \mathcal{A}(s_t). \end{array}

Notice in the optimization problem above that the initial state is
only ever used to choose the first action. Later actions do not depend
directly on it but only on its consequences. This means that we can
break the problem into two parts: calculating an *immediate cost*
dependent on the initial state and calculating a future cost dependent
on all next states.

\begin{array}{rl} v^\star(s) = \min\limits_{a,a_t} & c(s, a) + \left( \begin{array}{rl} \min\limits_{a_t} & \sum\limits_{t=2}^\infty \gamma^{t-1}c(s_t, a_t) \\ \textrm{s.t.} & s_2 = s', \\ & s_{t+1} = T(s_t, a_t), \\ & a_t \in \mathcal{A}(s_t) \end{array} \right) \\ \textrm{s.t.} & s' = T(s, a), \\ & a \in \mathcal{A}(s). \end{array}

There’s already some recursive structure unfolding in here! What is still missing consists of noticing that since the sum in the future cost starts at t =2, we can factor out \gamma. By renaming l = t-1 we get

\sum\limits_{t=2}^\infty \gamma^{t-1}c(s_t, a_t) = \gamma \sum\limits_{t=2}^\infty \gamma^{t-2}c(s_t, a_t) = \gamma \sum\limits_{l=1}^\infty \gamma^{l-1}c(s_l, a_l),

and applying this in the expression for v^\star,

\begin{array}{rl} v^\star(s) = \min\limits_{a} & c(s, a) + \gamma\left( \begin{array}{rl} \min\limits_{a_l} & \sum\limits_{l=1}^\infty \gamma^{l-1}c(s_l, a_l) \\ \textrm{s.t.} & s_1 = s', \\ & s_{l+1} = T(s_l, a_l), \\ & a_l \in \mathcal{A}(s_l) \end{array} \right) \\ \textrm{s.t.} & s' = T(s, a), \\ & a \in \mathcal{A}(s). \end{array}

Although this is a huge expression, it should be straightforward to
see that the expression for the future cost is *exactly* the
optimal value v^\star(s') of
starting the dynamics at s' = T(s,
a). This way, the principle of optimality express itself
mathematically as a recursive equation that the value for an optimal
policy must satisfy.

\begin{array}{rl} v^\star(s) = \min\limits_{a} & c(s, a) + \gamma v^\star(s') \\ \textrm{s.t.} & s' = T(s, a), \\ & a \in \mathcal{A}(s). \end{array}

This is called the *Bellman equation* and all dynamic
programming consists of methods for solving it. Even more: we can think
of the Bellman equation as a recursive specification for the decision
problems and of dynamic programming as any problem-specific
implementation that solves it.

It is time to get deeper into analysis. Whenever mathematicians see a recursive relation such as the Bellman equation, they immediately start asking such things: what guarantees do we have about v^\star? Can I trust that it is unique? Does it even exist? Surely we mathematicians may seem a bit too anxious with all these questions, but they are for good reasons. Besides the guarantee that everything works, proving the existence of a solution in many cases also teaches us how to construct this solution. In fact, this will be just the case! In the next section we are going to adapt the theorems in here into algorithms to solve the Bellman equation.

Recursion has a deep relationship with fixed points. For example, we
can adapt the Bellman equation as the fixed point of an operator \mathcal{B}: (\mathcal{S}\to \mathbb{R}) \to
(\mathcal{S}\to \mathbb{R}) called, you can guess, the
*Bellman Operator*. It takes value functions to value functions
and is defined as

\begin{array}{rl} (\mathcal{B}v)(s) = \min\limits_{a} & c(s, a) + \gamma v(s') \\ \textrm{s.t.} & s' = T(s, a), \\ & a \in \mathcal{A}(s). \end{array}

If you are not accustomed with fixed points, the transition above from the Bellman equation to operator may seem strange. Hence, let’s think about a little story in order to develop some intuition.

Imagine that you are the king/queen of fantastical kingdom. You are an absolute monarch and whatever action you decide to take, your subjects will follow. Lately, the kingdom’s reserves are running dry and your counselors advised you to define a clear governing policy in order to minimize the kingdom’s spendings. Since besides a ruthless ruler you’re also a great mathematician and a fan of my blog, at this point you already know what must be done to save your kingdom from bankruptcy: solve the Bellman equation in order to find its optimal policy.

Because at this point of the post you still don’t know how to solve it, you decide to take advantage of your fantastical world and hire a wizard to look into his crystal ball and act as an oracle telling you how much each possible state will cost to the kingdom in the future. Beware though, that it is not wise to blindly follow advices from a crystal ball. The right thing to do in this situation is to use this oracle to decide on what to do at each state. Now, instead of solving the Bellman equation, all you have to do to get a policy is to solve the optimization problem with future cost given by the wizard’s predictions. The processes of going from prediction to decision is precisely the Bellman operator. The function \mathcal{B}v is the cost of choosing the best action according to the prediction.

The optimal value function, furthermore, is the one with the correct prediction, and is, thus, kept unchanged by the Bellman operator:

\mathcal{B}v^\star = v^\star.

We thus reduce the question of existence and uniqueness of solutions
for the Bellman equation to finding fixed points of \mathcal{B}. Fortunately there is a theorem
that does just that, called the *Banach Fixed Point* theorem!

Let (M,\mathrm{d}) be a complete metric space and f : M \to M a continuous function such there is a \gamma \in (0,1) satisfying

\mathrm{d}(f(x), f(y)) \le \gamma \mathrm{d}(x, y)

for any x, y \in M. Then f has a unique fixed point x^\star and for any initial value x_0, iterating f will eventually converge towards x^\star,

\lim_{n \to \infty} f^n(x_0) = x^\star.

Proving this theorem is out of scope for this post^{4}.
However, we can think of it as saying that if a mapping shrinks all
distances, then eventually the image of all points will be squeezed into
a single point.

To apply the Banach fixed point theorem on the Bellman operator, we need to find a metric space where \mathcal{B} is a contraction. The suitable space are the bounded continuous functions over the states, C^0_b(\mathcal{S}, \mathbb{R}), imbued with the uniform norm

\|f\|_\infty = \sup_{s \in \mathcal{S}} |f(s)|.

This is a complete metric space and is general enough for any
practical problem we may think of. Furthermore, recall that if the state
space is discrete, then *any function is continuous and bounded*,
hint that this encompasses the most important spaces from a
computational perspective.

Since I don’t want to depart too much from this post’s main topic nor dive into mathematical minutiae,, this section only enunciates the important theorem. In case you are interested, the necessary proofs are in an appendix for completeness.

The Bellman operator is a continuous contraction whenever the discount factor is \gamma < 1. Thus, it has a unique fixed point v^\star and no matter the initial choice for the value function v_0,

\mathcal{B}^{(n)} v_0 \xrightarrow{n \to \infty} v^\star.

Moreover, we can estimate the distance between the nth iterate v_n and the optimum through:

\mathrm{d}(v_n, v^*) \le \frac{\gamma^n}{1 - \gamma} \mathrm{d}(v_0, \mathcal{B}v_0).

Besides the existence and uniqueness guarantees, this theorem also spells something with a more practical character. No matter with which cost estimative we start, we can use the Bellman operator to update our value function until it converges towards the optimal. This is the next section’s topic.

For this section, let’s assume that both the state \mathcal{S} and action \mathcal{A}(s) spaces are finite. This will allow us to focus on exhaustive methods exploring the entire state space. Keep calm however, I am planning to write other posts in the future to explain how these ideas generalize to continuous spaces, or even to finite spaces that are too huge to explore entirely, via something called Approximate Dynamic Programming or Reinforcement Learning. But this is a story for another night…

From the previous discussion, we learned that iterating the Bellman
operator convergences towards the optimal the optimal value function.
Thus, we arrive at our first algorithm: *value iteration*. Its
main idea is actually quite simple: to convert the Bellman equation into
an update rule.

v \gets \mathcal{B}v.

We can thus start with an initial value function v_0 and iterate the update rule above. By the magic of the Banach Fixed Point theorem, this will converge towards the optimal value function no matter what the initial value function is. This procedure repeats until the uniform error \| v - \mathcal{B}v \|_\infty becomes less than a previously set tolerance (for which we have an estimate of necessary iterations).

By efficiency reasons, it is customary to represent the value function not as a function but using some data structure (usually a vector). The memoization that people associate with dynamic programming lies entirely in this “trick”. However, it is good to keep in mind that this is only a matter of computational representation of functions and is totally orthogonal to any algorithm’s design. In a language with first-class functions, it is possible to do dynamic programming using only function composition. It is just possible that it will not be as fast as one would like.

In value iteration, we obtain an optimal policy from the value function by keeping track of the \argmin whenever we solve an optimization problem. Below we see a Julia implementation of value iteration.

```
# The dynamics total cost of choosing action a at stage s,
# given a future cost estimate v.
function total_cost(v, s, a)
cost(s, a) + γ*v[next(s, a)]
end
function value_iteration(v0 ; tol)
= copy(v0)
v π = Policy{States, Actions}()
= Inf
maxerr while maxerr > tol
= 0
maxerr for s in States
= v[s]
prev π[s] = findmin(a -> total_cost(v, s, a), Actions(s))
v[s], = max(maxerr, abs(v[s] - prev))
maxerr end
end
return π, v
end
```

In the animation below, we can see value iteration in action for the problem of escaping from a maze. In this model, each state is a cell in the grid and the actions are the directions one can take at that cell (neighbours without a wall). The objective is to reach the right-bottom edge in the minimum amount of steps possible. We do so by starting with a uniformly zero value and a random policy. In the left we see the value function at each iteration and in the right the associated policy.

The algorithm above is in fact just one variation of value iteration. There are still many problem-dependent improvements one can make. For example, we chose to update v in-place, already propagating the new value function while traversing the states, but we could instead have kept the old value function and only updated v after traversing all states. Our approach has the advantage of using the improved information as soon as it is available but updating in batch may be interesting when we’re able to broadcast the optimization across many processes in parallel.

Other important choice we have is the initial value function. Choosing a good warm start can greatly improve the convergence. As an example, whenever there is a terminal state \blacksquare, it is a good idea to already fill v(\blacksquare) = 0. Finally, the order that we traverse the states matter. There is a reason why dynamic programming is famous for solving problems backwards. If we know that a given state is easier to solve, we should start the traverse by it. A specialization that we will further explore in a bit.

Well, we have a lot of options… Nevertheless, as long as we keep visiting all states, any of those approaches is guaranteed to converge towards the optimal value function, which one is faster being generally problem-dependent. This is why I think it is best to think of DP not as an algorithm but as a principle that encompasses many similar algorithms.

Let’s take look at a typical scenario where we can exploit the state
space structure to make value iteration much faster: problems with a
*fixed finite horizon*.

When the dynamics ends at a certain number N of time steps, we say that the problem has
a finite horizon. In this case, the stage t is part of the state variable, (Because how
else would we know when to stop?) and there is a *terminal state*
\blacksquare with zero cost and
representing anything that happens after the dynamics end. Essentially,
we work for N stages and then reach
\blacksquare, where we can just relax
and do nothing for the rest of eternity.

If you prefer equations over figures, a finite horizon state machine is one where the transition function looks like this:

\bar{T}((t, s),\, a) = \begin{cases} (t + 1, T(s, a)), & t < N \\ \blacksquare, & t = N. \end{cases}

But what is so special about finite horizons? After all, the equation above seems much more confuse than what we had before. Well… what we gain is that the state space \mathcal{S} is clustered into smaller spaces that are visited in sequence: \mathcal{S}_1, \mathcal{S}_2, \ldots, \mathcal{S}_N, \mathcal{S}_{N+1} = \{\blacksquare\}.

The *backward induction* algorithm consists of value iteration
but exploiting the structure we just discussed. At the terminal state,
the cost is always zero, so we can set v(\blacksquare) = 0. But this, means that the
*future cost* for all states in \mathcal{S}_N is fixed, and the optimal
policy for them is just to choose the cheapest action. Now that we know
which action to choose in \mathcal{S}_N, the future is fixed for all
states in \mathcal{S}_{N-1}, reducing
the problem in them to the routine we just did. We repeat this procedure
until we reach the first stage. This calculates an optimal policy by
induction, but moving “backwards” in time.

In Julia code, the algorithm looks like this:

```
function backward_induction()
= Values{States}()
v π = Policy{States, Actions}()
for t in N:1
for s in States(t)
π[s] = findmin(a -> total_cost(v, s, a), Actions(s))
v[s], end
end
return π, v
end
```

I recommend you to compare this with the generic value iteration in
order to see what we gain. One thing should be clear: backward induction
does the exact same operation as value iteration for each state, but
only requires a single pass over the state space. This makes it much
more efficient. Another more subtle detail is that since we have more
structure to exploit in the dynamics, the algorithm doesn’t have to
assume so much about the Bellman operator. For example, although
hand-wavy, we just gave a proof of convergence above that has no
dependence on the Banach fixed-point theorem. Thus, as a bonus, backward
induction works for any discount factor \gamma.^{5}

One issue with value iteration is that all policy calculations are implicit, since we just work with value functions. It is then possible that we reach an optimal policy and keep iterating the algorithm to improve the estimate for the value function. In this section, let’s see how we can directly calculate an optimal policy in a finite number of steps.

Our next question is then how to calculate the cost associated with a policy. Let’s say somebody gave you a policy \pi and told you nothing more about it. How can you find its value function v^\pi? One way is to notice that it satisfies a recursion similar to the Bellman equation, but without the minimization step.

v^\pi(s) = c(s, \pi(s)) + \gamma v^\pi(T(s, \pi(s)).

If you follow the same steps we did before transforming this equation
into a fixed point problem, you will see that under the same assumptions
of continuity (always valid for finite state and action spaces) it also
has a unique solution. Moreover, turning it into an update procedure
converges towards v^\pi for any initial
value function. This way, we arrive at an algorithm for evaluating the
cost of a policy, unimaginatively called *policy evaluation*.

```
function policy_evaluation(π, v0=zeros(States); tol)
= copy(v0)
v = Inf
maxerr while maxerr > tol
= 0
maxerr for s in States
= v[s]
prev = total_cost(v, s, π(s)) # same total cost function as defined for value iteration
v[s] = max(maxerr, abs(v[s] - prev))
maxerr end
end
return v
end
```

Notice the similarity with value iteration. The only difference is on the update rule: instead of choosing an optimal action, we just follow along with the policy.

After we know a policy and its value function, our next question is how to improve it. That is, how can we use this information to get nearer to an optimal policy.

The secret lies in locally improving our policy for each state.
Consider a state s. The value v^\pi(s) is the total cost of starting at
s and following \pi thereafter. What if there is an a \in \mathcal{A}(s) such that choosing a at the first step and following \pi thereafter is *cheaper* than just
following \pi?

c(s, a) + v^\pi(T(s, a)) < v^\pi(s).

Since we have this improvement at the first step and our processes are assumed to not depend on anything besides what is represented on the state s, then it must be better to choose a than \pi(s) whenever we are at state s. That is, if we define a new policy

\lambda(x) = \begin{cases} a, & x = s \\ \pi(x), & \text{otherwise} \end{cases}

it is always better to follow \lambda then \pi, because v^\lambda \le v^\pi.

We thus arrive at our next algorithm: *policy iteration*. It
consists of taking a policy \pi,
finding its value function v^\pi
through policy evaluation and finally using policy improvement to arrive
at a better policy. Since there are only finitely many policies, and we
always obtain a strictly better policy, this algorithm is guaranteed to
converge to an optimal policy in a finite amount of steps.

```
function policy_improvement(π0, v)
π = copy(π0)
for s in States
π[s] = argmin(a -> total_cost(v, s, a), Actions(s))
end
return π
end
function policy_iteration(π, v=zeros(States); tol)
while true
0 = π
π= policy_evaluation(π, v; tol=tol)
v π = policy_improvement(π, v)
if π == π0 break end
end
return π, v
end
```

Did you notice that we iteratively alternate between two kinds of passes over the states? This is the mother of all backwards-forwards algorithms people generally associate with dynamic programming.

Just like with value iteration, there’s also a lot of freedom in how we traverse the states. Again it is useful to think of policy iteration more as a principle than as an algorithm in itself and adapt the steps to consider any problem specific information that may be available.

Ok, it’s time to recapitulate a bit. We started our journey looking at automata and controllable dynamics, showing how we can represent the best set of actions as those who fulfill a certain recursive relation on their value functions, called the Bellman equation. Then, we explored a couple ways of solving this equation through methods for finding fixed points. Now, a question for the reader: how much do you think those methods actually depended on the Bellman equation itself?

If you go back to our algorithms for *value iteration*,
*backwards induction*, or *policy iteration*, you will
notice that the Bellman operator itself only appears in a single line.
Furthermore, this apparition is completely encapsulated. The algorithms
do not depend on the Bellman operator per se; they work because it
satisfies a couple reasonable properties!

Very well, the mathematician inside me sees something like that and starts thinking about generalizing. Let’s see where a bit more abstraction leads us.

Value iteration is simply Banach’s fixed point in disguise. Hence, it
is a tool capable of solving any problem described by a recursive
equation where the operator is a monotone contraction. Similarly, we can
use policy iteration whenever the operator associated with following a
policy shows a monotone contraction behavior. To be fair, one could go
*even further* and substitute the hypothesis from Banach’s
theorem for any other theorem guaranteeing an attractor among the value
functions. As this can get rather technical, We will not explore this in
here, but if you are curious (and a big fan of abstract books), a great
place to start is Dimitri P. Bertsekas^{6}’s book.

Alright, this post is becoming too abstract for my taste. Let’s dive into a couple examples to see why this generalization is cool.

Until now, we’ve only dealt with deterministic processes. Life, on the other hand, is full of uncertainty and, being a nice applied field, dynamic programming was created from its inception keeping in mind stochasticity.

We call a state machine where the transitions T(s, a) are stochastic a *Markov Decision
Process* (MDP for short). This name comes from the fact that the new
state only depends on the current state and action, being independent of
the process’ history; Just like a Markov chain. A usual intuition for
this kind of processes is as the interaction between an actor and an
environment. At each time step, the environment is at a state s (which is known to the actor), and the
actor may choose among different actions a \in
\mathcal{A}(s), each one incurring a certain cost^{7}, to
interact with it. This action affects the environment in some way that
is out of reach to the actor (thus stochastic / non-deterministic),
changing its state to s' = T(s, a).
This is illustrated in the diagram below.^{8}

Allowing non-determinism opens the way for modeling a lot more cool
situations. For example, robots that play video games! The states may be
the internal state of the game or some partial observation of them that
is available to the player and the actions are the buttons on the
controller. The transitions are internal to the game and the costs are
related to some winning/losing factor. Have you ever heard of Volodymyr Mnih et al.^{9}’s *Playing Atari with Deep
Reinforcement Learning* paper? In it, they use reinforcement
learning to train a robot capable of playing Atari 2600 games and all
modeling is done via Markov decision processes in a way that is really
similar to the discussion in this paragraph. I really recommend checking
it out.

With a stochastic environment, we can’t necessarily predict the total costs and transitions, only estimate them. To accommodate that, we will have to change a few things in our definition of policies and value functions.

Even if fix a policy \pi, we cannot be certain of what is going to happen, because the costs and transitions are stochastic. We are, however, able to calculate the average cost by taking the expected value among all possible outcomes, conditioned to following the policy \pi.

v^\pi(s) = \mathbb{E}\left[ \sum\limits_{t=1}^\infty \gamma^{t-1}c(s_t, a_t) \middle| s_1 = s,\, s_{t+1} = T(s_t, a_t),\, a_t = \pi(s_t) \right]

Using that the expected value is linear, one can redo the reasoning
from the deterministic case to arrive at a *stochastic Bellman
equation*:

\begin{array}{rl} v^\star(s) = \min\limits_{a} & c(s, a) + \gamma \mathbb{E}\left[v^\star(s') \right] \\ \textrm{s.t.} & s' = T(s, a), \\ & a \in \mathcal{A}(s). \end{array}

For finite state spaces, this equation has the same contraction properties as the deterministic one. Hence, the tools of value and policy iteration are also readily available to solve MDPs. The optimization problems are harder, because of the expected value, but conceptually it is the same.

This last example is actually pretty simple and using the full power we’ve developed is a real overkill. Nevertheless, as it is a rather common first example to encounter when learning Dynamic Programming in Computer Science classes, it is worth it to view this kind of problem in the light of the formalism we’ve developed.

A recurrence relation is any function \N \to \mathbb{R}, where the nth term is recursively defined through the previous terms.

f(n) = \begin{cases} c_n, & n < k \\ g(n, f(n-1), \ldots, f(n-k)), & n \ge k. \end{cases}

The c_n are constants which correspond to the base case of the recursion the value of f(n) depends directly on the previous k terms.

Famous examples with this structure are the factorial and Fibonacci functions:

\begin{align*} \mathrm{fat}(n) &= \begin{cases} 1, & n = 0 \\ n \cdot \mathrm{fat}(n-1), & n \ge 1. \end{cases} \\ \mathrm{fib}(n) &= \begin{cases} 0, & n = 0 \\ 1, & n = 1 \\ \mathrm{fib}(n-1) + \mathrm{fib}(n-2), & n \ge 2. \end{cases} \\ \end{align*}

In general, evaluating f(n) directly via the definition may be exponentially slow. But, in fact, it is always possible to calculate it in linear time using dynamic programming.

To calculate f(n), we consider a state machine where the states are the numbers from 0 to n and where there is only a dummy action \star. Then, we recast the recursion as an operator

(\mathcal{B}v)(n) = \min_{a \in \{\star\}} \begin{cases} c_n, & n < k \\ g(n, v(n-1), \ldots, v(n-k)), & n \ge k. \end{cases}

Since, there is only a single action, the minimization is redundant and the optimal value function satisfies the desired recursive equation. Thus, under very mild conditions our assumptions hold, and we can solve this equation by finding the fixed point of \mathcal{B}. As an example, the following video shows the first 15 Fibonacci numbers being calculated via value iteration.

We can improve this method even more by using Backwards Induction! To evaluate f(n), we constructed a finite horizon problem where there is only a single state per stage. Thus, considering that our initial state is n, we just have to fill the value function starting from 0 (which means going backwards, in this case) and we evaluate the recursion in n steps. Below there is an example solving the Fibonacci equation, but taking into account the order of states.

Well, we finally reached the end of our overview of dynamic
programming. I sincerely hope it was as fun for you to read as it was
for me to write. And that DP gets the honor place it deserves in your
problem solving toolkit!^{10}

Of course, a single blog post is too tinny to encompass a subject as vast as DP. There are matters about estimating the value function instead of simply calculating it, infinite state spaces, continuous time and a plethora of cool stuff we can do. There are also a lot of connections with reinforcement learning, for which we only scraped the surface in this post. Unfortunately, these will remain as stories for another night.

Farewell and see you next time!

This post come to life after a series of conversations I had with Pedro Xavier. The good thing of explaining something to a smart person is that you end up learning a lot in the process. Sometimes you even learn enough to write a blog post about it.

I’m also in debt with Ivani Ivanova for being such a great typo hunter. If there is any typo left, it is because I’m lazy… She did a marvelous job.

In this appendix we show that the Bellman Operator

\begin{array}{rl} (\mathcal{B}v)(s) = \min\limits_{a} & c(s, a) + \gamma v(s') \\ \textrm{s.t.} & s' = T(s, a), \\ & a \in \mathcal{A}(s) \end{array}

satisfies all the requisites to be a contraction over the space of bounded continuous functions.

To prove that \mathcal{B} is a
contraction, we will start with another of its properties:
*monotonicity*. Besides this proof, it will also be useful in the
future when we talk about policy improvement.

Suppose that we have two value function v and w such that v estimates costs uniformly lower than w:

v(s) \le w(s),\; \forall s \in \mathcal{S}.

Then the Bellman operator preserves this relationship: (\mathcal{B}v)(s) \le (\mathcal{B}w)(s), \; \forall s \in \mathcal{S}.

Given a state s, we get from v \le w that the following inequality holds for any action a \in \mathcal{A}(s):

c(s, a) + \gamma v(T(s,a)) \le c(s, a) + \gamma w(T(s,a)).

Since this is valid for any a, taking the minimum on both sides preserves the inequality. Thus

\min_{a \in \mathcal{A}(s)} c(s, a) + v(T(s,a)) \le \min_{a \in \mathcal{A}(s)} c(s, a) + w(T(s, a)) \\ (\mathcal{B}v)(s) \le (\mathcal{B}w)(s).

Finally, let’s prove that the Bellman operator contracts the space of bounded continuous functions by the discount factor. What we need to show is that for any value function v, w:

\|\mathcal{B}v - \mathcal{B}w\|_\infty \le \gamma \|v - w\|_\infty.

From the definition of the uniform norm, we get that for any state s,

v(s) - w(s) \le \|v - w\|_\infty \\ v(s) \le w(s) + \|v - w\|_\infty.

From the monotonicity we just proved, applying \mathcal{B} to both sides preserves this inequality:

(\mathcal{B}v)(s) \le \mathcal{B}(w + \|v - w\|_\infty)(s).

Let’s show that the constant factor \|v - w\|_\infty only shifts the new function \mathcal{B}w by uniformly by another constant. Calling it k to declutter the notation,

\begin{array}{rlll} (\mathcal{B}(w) + \|v - w\|_\infty)(s) &= &\min\limits_{a} & c(s, a) + \gamma (w(s') + \|v - w\|_\infty) \\ &&\textrm{s.t.} & s' = T(s, a), \\ && & a \in \mathcal{A}(s) \\ &= &\min\limits_{a} & c(s, a) + \gamma (w(s')) + \gamma \|v - w\|_\infty \\ &&\textrm{s.t.} & s' = T(s, a), \\ && & a \in \mathcal{A}(s). \end{array}

This proves that

\begin{aligned} (\mathcal{B}v)(s) &\le (\mathcal{B}w)(s) + \gamma \|v - w\|_\infty \\ (\mathcal{B}v)(s) - (\mathcal{B}w)(s) &\le \gamma \|v - w\|_\infty. \end{aligned}

By doing the same derivation in the opposite direction (for w - v) we get an inequality for the absolute value. Applying the supremum, it becomes the result we want.

\begin{aligned} |(\mathcal{B}v)(s) - (\mathcal{B}w)(s)| &\le \gamma \|v - w\|_\infty \\ \sup_{s\in\mathcal{S}} |(\mathcal{B}v)(s) - (\mathcal{B}w)(s)| &\le \gamma \|v - w\|_\infty \\ \|\mathcal{B}v - \mathcal{B}w\|_\infty &\le \gamma \|v - w\|_\infty. \end{aligned}

Since \mathcal{B} is a contraction, the Banach fixed point theorem guarantees to us that there exists a unique value function v^\star satisfying the Bellman equation. Furthermore, the theorem also points us towards a way to solve this kind of procedure with polynomial complexity on the size of the state and action spaces. This is the topic we’re going to investigate next.

To be more precise, we work with hydrothermal dispatch problems, where one must decide between many sources of energy (hydro, thermal, renewable) to supply a certain power demand taking into account the uncertainties of the future. For example: hydro is cheap and clean, but you risk running out of water if you use all of it and next month turns out particularly dry. Finding the best energy dispatch is once again solved via dynamic programming.↩︎

Even Richard Bellman admittedly named it based on how cool it sounds.↩︎

*Dynamic Programming*, Princeton Landmarks in Mathematics and Physics (Princeton, NJ: Princeton University Press, 1957), ch 3, p. 83.↩︎It is out of scope more because it is a tangent to the topic than because of any difficulty in the proof. If you are interested in analysis, I really recommend you to try proving it. The proof consists of using the contraction property to show that the distance between the iterations of f must converge towards zero.↩︎

In contrast with value iteration, it also has no dependence on the costs being real numbers. In this case any ordered ring (such as \mathbb{N} or \mathbb{Z}) would do. But I digress… This is out of scope for this post.↩︎

*Abstract Dynamic Programming*(Belmont, Mass: Athena Scientific, 2013).↩︎It is also possible to define MDPs where the cost is stochastic. The formulations are equivalent, and one may choose which one best models the problem at hand.↩︎

Loosely based on the diagram found in Richard S. Sutton and Andrew G. Barto,

*Reinforcement Learning: An Introduction*, Second edition, Adaptive Computation and Machine Learning Series (Cambridge, Massachusetts: The MIT Press, 2018).↩︎“Playing Atari with Deep Reinforcement Learning,” December 2013, https://doi.org/10.48550/ARXIV.1312.5602.↩︎

In fact, fixed points and recursion as a whole deserve the spot. They’re everywhere!↩︎