# A Tale of Dynamic Programming

25 June 2022

What if I told you that some of the most used algorithms to find the shortest path in a graph, calculate gradients while training a neural network, and parse context-free grammars are essentially implementations of the same principle? It is called dynamic programming and is one of those instances in mathematics where a simple principle unfolds into profound conclusions ranging over many fields. In fact, we can, already in this first paragraph, summarize the idea using Richard Bellman’s (Dynamic Programming’s creator) own words:

An optimal policy has the property that whatever the initial state and initial decision are, the remaining decisions must constitute an optimal policy with regard to the state resulting from the first decision.

I gotta say that I was taught dynamic programming in different contexts, but it took me a while to finally get the “click” that they were actually the same thing. When learning algorithms and data structures, it was a memoization-based technique where you speed up your algorithm by first solving the easier parts and saving them for later use. At work, I mostly deal with solving a lot of linear programs for long-term scheduling problems.1 The main algorithm we use, called Stochastic Dual Dynamic Programming, at first didn’t seem so much like the programming technique algorithms class. Finally, one of the main methods for model-based reinforcement learning is again called dynamic programming, and it also didn’t seem so much like the other instances.

So, what’s happening here? Did everybody choose to call their algorithms dynamic programming just because it’s a cool name?2 Well, in fact there are some principles that apply to all of those instances, from planning a rocket’s trajectory to TeX’s word-wrapping. And the list goes on and on.

I want to invite you to a journey through many realms of mathematics. We will range from automata to optimal control, passing through Markov chains, dynamical systems, linear programming and even metric spaces. Take your seat and enjoy the ride!

## On Decision-Making and State Machines

Before delving into dynamic programming per se, we first have to establish a few concepts. After all, it’s always best to know which problems you intend to solve before learning a method to solve them, right?

As a matter of motivation, let’s start with something I am really fond of: old school platformer games. In our hypothetical game which is definitely not about some Italian plumber, the character does stand idle doing nothing by default. But with the press of a button in the controller, the player may command the character to do a few things: shoot, jump or walk. And, of course, each of these actions activate the respective animation on the screen. In the best Resident Evil style, this game only allows a character to shoot while idle and also forces you to first be idle after a jump before doing any other action. Think of that as the time it takes to restore one’s balance after falling. This description may seem overly complicated on text, but fortunately the nice folks in the Comp Sci department already invented diagrams that show these transitions nicely.

Our modeling above is an instance of something called a state machine or automata if you’re into Greek words. There are 4 states in which the character might be and at each one there is an available set of actions to take that transitions that state. More abstractly, an automaton is a system that can be in one of many states s \in \mathcal{S} and at each state, you can choose among a set of actions a \in \mathcal{A}(s) that transition the system to a new state s' = T(s, a), where T : (s : \mathcal{S}) \times \mathcal{A}(s) \to \mathcal{S} is called the system’s transition function.

An important notice: If you come from Computer Science, you are probably used to finite state machines. Just know that in our context, the states may be any set. Some of the algorithms that we will see today only work for finite state spaces, but there are others that may even require a continuous space! An example is Stochastic Dual Dynamic Programming, which is based upon linear programming duality and thus requires the state space to be a convex subset of \mathbb{R}^n.

### The Dynamics of Decision-Making

Iterating the transition T establishes a dynamics for our system where we start at an initial state s and by taking a sequence of actions a_1, a_2, \ldots we walk over the state space.

\begin{aligned} s_1 &= s, \\ s_{t+1} &= T(s_t, a_t). \end{aligned}

This new view makes our state machine somewhat equivalent to a controllable dynamical system, which is another really cool name in my opinion.

As an example, think of a game of Sudoku. The states are the (partially numbered) boards and the actions consist of putting a valid number in a certain coordinate. You start with some random board and repeatedly place numbers until you reach a terminal state where there are no available actions.

One can argue that a state encapsulates all you must know about your system in order to choose an action, no matter the previous history nor time step. Indeed, if those things affect your choice, then you can without loss of generality model the problem as a larger system where the state also carries this information. Thus controlling a dynamic system amounts to a function \pi : (s : \mathcal{S}) \to \mathcal{A}(s) which chooses a valid action for each state. In the literature this called a policy, in analogy to a government taking actions to run its state.

Unfortunately life is not known for its free lunches and in most systems, whenever we take action a at state s, there is a certain cost c(s, a) to pay. Depending on the context this can be, for example, a real monetary cost (for economic problems), some total distance (for planning) or even a negative cost representing a reward.

Thus, by following a policy \pi, we produce a sequence of cost c(s_t, \pi(s_t)) for each time step. We could define the total cost for \pi as the sum of those costs, but there is an additional detail to notice. If I gave you something and asked whether you want to pay me today or next year, which option would you prefer? Sometimes there are factors such as inflation or interests that make costs in the future not have the same actual value as the costs we expend right now. This prompts us to introduce a problem dependent discount factor \gamma \in [0, 1] such that the total cost for a policy \pi is

\begin{array}{rl} v^\pi(s) = & c(s_1, \pi(s_1)) + \gamma c(s_2, \pi(s_2)) + \gamma^2 c(s_3, \pi(s_3)) + \ldots \\ \textrm{where} & s_1 = s, \\ & s_{t+1} = T(s_t, \pi(s_t)), \\ \end{array}

The equation above defines the value function v^\pi : \mathcal{S}\to \mathbb{R} for a given policy \pi. Spoiler: keep an eye on the v^\pi, because later in this post we will find them to be useful tools closely related to the memoization techniques that people usually identify with dynamic programming.

Besides its practical interpretation, the discount factor \gamma also plays a significant role from the mathematical point of view. If |\gamma| < 1 and the costs are uniformly bounded (which is the case for a finite action space, for example) we can guarantee that the series defining v^\pi converges for any policy and initial state. That is, suppose that exists M > 0 such that

\forall s \in \mathcal{S}, a \in \mathcal{A}(s),\, |c(s, a)| \le M.

This bounds the total cost by a geometric series that cannot blow up,

\sum\limits_{t=1}^\infty \gamma^{t-1}|c(s_t, \pi(s_t))| \le \sum\limits_{t=1}^\infty \gamma^{t-1} M \le \frac{M}{1 - \gamma},

thus guaranteeing that the value function is well-defined.

### Optimal Decisions

Having multiple courses of action possible prompts us to ask which is the best one possible. When programming a robot to escape a labyrinth, you want it to take the least amount of time. When controlling a spaceship towards the moon, it is important to guarantee that it will use the least amount of fuel. When brawling at a bar, you want to knock out your foe with the least injuries possible. Must of all, the best policy is the one with the least cost taking all time into account; both the present and its future consequences. For example, sometimes a policy that has a higher cost for the first state is overall better because it puts us into a more favorable state. Thus, our problem can be naturally formulated as searching for optimal policies:

Starting at state s, find a policy \pi producing the least total cost over time.

Or equivalently in math language:

\begin{array}{rl} \min\limits_\pi v^\pi(s) = \min\limits_{a_t} & \sum\limits_{t=1}^\infty \gamma^{t-1}c(s_t, a_t) \\ \textrm{s.t.} & s_1 = s, \\ & s_{t+1} = T(s_t, a_t), \\ & a_t \in \mathcal{A}(s_t). \end{array}

Right now, this may seem like a big and scary optimization problem but in fact it contains a lot of structure that we can exploit in order to solve it. This will be the subject of the next section. Before we continue, let’s go over a little tangent on how to formulate some classical problems in this decision making framework.

#### Example: Shortest Path in a Graph

Suppose you are at your hometown and just received a message from a friend telling you that there are singing llamas in Cuzco, Peru, right now. This makes you at the same time incredulous and curious, so you just pick your favorite bike and get on the road towards Cuzco. Unfortunately there are no direct bikeways connecting your home to Cuzco, meaning that you will have to find a route going through other cities. Also, there is a risk that the llamas will stop to sing at any time and just go back to their usual behavior of eating grass throughout the mountains. This prompts you to decide to take the shortest possible path to Cuzco.

The above description is an instance of finding the shortest path in a graph. In it, we represent each city by a graph node and direct routes between two cities as a weighted edge where the weight is the distance. Going from home to Cuzco amounts to finding the path between those two nodes with the smallest total distance.

The translation from this graph description to a decision process description is quite straightforward.

• States: nodes in the graph.
• Actions at state s: edges going from s to another node.
• Transition: The opposite node on the same edge. That is, given an edge s \to s', T(s, s \to s') = s'.
• Costs: c(s, a) is the weight of edge a.

Finding the shortest path from s to node z is the same as setting the initial state to s and making z a terminal state of our dynamics.

## Dynamic Programming

Alright, it’s finally time to solve those decision problems. The simplest idea could be to exhaustively search the space of all actions trying to find the best solution. Notice that even for finite states and horizon, this may be prohibitively expensive since the possible candidates grow exponentially with the time steps. Any practical method will take into account how this class of problems naturally breaks apart into separate stages.

Our approach will involve the famous Bellman principle of optimality, which is the cornerstone of dynamic programming. Taking Richard E Bellman3’s own words, it reads as:

An optimal policy has the property that whatever the initial state and initial decision are, the remaining decisions must constitute an optimal policy with regard to the state resulting from the first decision.

Alright, what does this mean? What the principle of optimality is telling us is that in order to calculate an optimal policy, we should turn this iterative process of making actions and calculating costs into a recursive procedure. That is, taking an action puts us into a new state s_2 = T(s_1, a_1) where we are again faced with the exact same problem of finding an optimal policy but this time starting at s_2. Let’s see how we can exploit this idea.

Remember that we defined the value function v^\pi as the total cost of following a policy \pi when starting at a given state. Let’s define the optimal value function v^\star as the total cost of choosing the best course of action while starting at a certain state s.

\begin{array}{rl} v^\star(s) = \min\limits_{a_t} & \sum\limits_{t=1}^\infty \gamma^{t-1}c(s_t, a_t) \\ \textrm{s.t.} & s_1 = s, \\ & s_{t+1} = T(s_t, a_t), \\ & a_t \in \mathcal{A}(s_t). \end{array}

Notice in the optimization problem above that the initial state is only ever used to choose the first action. Later actions do not depend directly on it but only on its consequences. This means that we can break the problem into two parts: calculating an immediate cost dependent on the initial state and calculating a future cost dependent on all next states.

\begin{array}{rl} v^\star(s) = \min\limits_{a,a_t} & c(s, a) + \left( \begin{array}{rl} \min\limits_{a_t} & \sum\limits_{t=2}^\infty \gamma^{t-1}c(s_t, a_t) \\ \textrm{s.t.} & s_2 = s', \\ & s_{t+1} = T(s_t, a_t), \\ & a_t \in \mathcal{A}(s_t) \end{array} \right) \\ \textrm{s.t.} & s' = T(s, a), \\ & a \in \mathcal{A}(s). \end{array}

There’s already some recursive structure unfolding in here! What is still missing consists of noticing that since the sum in the future cost starts at t =2, we can factor out \gamma. By renaming l = t-1 we get

\sum\limits_{t=2}^\infty \gamma^{t-1}c(s_t, a_t) = \gamma \sum\limits_{t=2}^\infty \gamma^{t-2}c(s_t, a_t) = \gamma \sum\limits_{l=1}^\infty \gamma^{l-1}c(s_l, a_l),

and applying this in the expression for v^\star,

\begin{array}{rl} v^\star(s) = \min\limits_{a} & c(s, a) + \gamma\left( \begin{array}{rl} \min\limits_{a_l} & \sum\limits_{l=1}^\infty \gamma^{l-1}c(s_l, a_l) \\ \textrm{s.t.} & s_1 = s', \\ & s_{l+1} = T(s_l, a_l), \\ & a_l \in \mathcal{A}(s_l) \end{array} \right) \\ \textrm{s.t.} & s' = T(s, a), \\ & a \in \mathcal{A}(s). \end{array}

Although this is a huge expression, it should be straightforward to see that the expression for the future cost is exactly the optimal value v^\star(s') of starting the dynamics at s' = T(s, a). This way, the principle of optimality express itself mathematically as a recursive equation that the value for an optimal policy must satisfy.

\begin{array}{rl} v^\star(s) = \min\limits_{a} & c(s, a) + \gamma v^\star(s') \\ \textrm{s.t.} & s' = T(s, a), \\ & a \in \mathcal{A}(s). \end{array}

This is called the Bellman equation and all dynamic programming consists of methods for solving it. Even more: we can think of the Bellman equation as a recursive specification for the decision problems and of dynamic programming as any problem-specific implementation that solves it.

### Existence, Uniqueness and Fixed Points

It is time to get deeper into analysis. Whenever mathematicians see a recursive relation such as the Bellman equation, they immediately start asking such things: what guarantees do we have about v^\star? Can I trust that it is unique? Does it even exist? Surely we mathematicians may seem a bit too anxious with all these questions, but they are for good reasons. Besides the guarantee that everything works, proving the existence of a solution in many cases also teaches us how to construct this solution. In fact, this will be just the case! In the next section we are going to adapt the theorems in here into algorithms to solve the Bellman equation.

Recursion has a deep relationship with fixed points. For example, we can adapt the Bellman equation as the fixed point of an operator \mathcal{B}: (\mathcal{S}\to \mathbb{R}) \to (\mathcal{S}\to \mathbb{R}) called, you can guess, the Bellman Operator. It takes value functions to value functions and is defined as

\begin{array}{rl} (\mathcal{B}v)(s) = \min\limits_{a} & c(s, a) + \gamma v(s') \\ \textrm{s.t.} & s' = T(s, a), \\ & a \in \mathcal{A}(s). \end{array}

If you are not accustomed with fixed points, the transition above from the Bellman equation to operator may seem strange. Hence, let’s think about a little story in order to develop some intuition.

Imagine that you are the king/queen of fantastical kingdom. You are an absolute monarch and whatever action you decide to take, your subjects will follow. Lately, the kingdom’s reserves are running dry and your counselors advised you to define a clear governing policy in order to minimize the kingdom’s spendings. Since besides a ruthless ruler you’re also a great mathematician and a fan of my blog, at this point you already know what must be done to save your kingdom from bankruptcy: solve the Bellman equation in order to find its optimal policy.

Because at this point of the post you still don’t know how to solve it, you decide to take advantage of your fantastical world and hire a wizard to look into his crystal ball and act as an oracle telling you how much each possible state will cost to the kingdom in the future. Beware though, that it is not wise to blindly follow advices from a crystal ball. The right thing to do in this situation is to use this oracle to decide on what to do at each state. Now, instead of solving the Bellman equation, all you have to do to get a policy is to solve the optimization problem with future cost given by the wizard’s predictions. The processes of going from prediction to decision is precisely the Bellman operator. The function \mathcal{B}v is the cost of choosing the best action according to the prediction.

The optimal value function, furthermore, is the one with the correct prediction, and is, thus, kept unchanged by the Bellman operator:

\mathcal{B}v^\star = v^\star.

We thus reduce the question of existence and uniqueness of solutions for the Bellman equation to finding fixed points of \mathcal{B}. Fortunately there is a theorem that does just that, called the Banach Fixed Point theorem!

Let (M,\mathrm{d}) be a complete metric space and f : M \to M a continuous function such there is a \gamma \in (0,1) satisfying

\mathrm{d}(f(x), f(y)) \le \gamma \mathrm{d}(x, y)

for any x, y \in M. Then f has a unique fixed point x^\star and for any initial value x_0, iterating f will eventually converge towards x^\star,

\lim_{n \to \infty} f^n(x_0) = x^\star.

Proving this theorem is out of scope for this post4. However, we can think of it as saying that if a mapping shrinks all distances, then eventually the image of all points will be squeezed into a single point.

To apply the Banach fixed point theorem on the Bellman operator, we need to find a metric space where \mathcal{B} is a contraction. The suitable space are the bounded continuous functions over the states, C^0_b(\mathcal{S}, \mathbb{R}), imbued with the uniform norm

\|f\|_\infty = \sup_{s \in \mathcal{S}} |f(s)|.

This is a complete metric space and is general enough for any practical problem we may think of. Furthermore, recall that if the state space is discrete, then any function is continuous and bounded, hint that this encompasses the most important spaces from a computational perspective.

Since I don’t want to depart too much from this post’s main topic nor dive into mathematical minutiae,, this section only enunciates the important theorem. In case you are interested, the necessary proofs are in an appendix for completeness.

The Bellman operator is a continuous contraction whenever the discount factor is \gamma < 1. Thus, it has a unique fixed point v^\star and no matter the initial choice for the value function v_0,

\mathcal{B}^{(n)} v_0 \xrightarrow{n \to \infty} v^\star.

Moreover, we can estimate the distance between the nth iterate v_n and the optimum through:

\mathrm{d}(v_n, v^*) \le \frac{\gamma^n}{1 - \gamma} \mathrm{d}(v_0, \mathcal{B}v_0).

Besides the existence and uniqueness guarantees, this theorem also spells something with a more practical character. No matter with which cost estimative we start, we can use the Bellman operator to update our value function until it converges towards the optimal. This is the next section’s topic.

### Solving the Bellman Equation

For this section, let’s assume that both the state \mathcal{S} and action \mathcal{A}(s) spaces are finite. This will allow us to focus on exhaustive methods exploring the entire state space. Keep calm however, I am planning to write other posts in the future to explain how these ideas generalize to continuous spaces, or even to finite spaces that are too huge to explore entirely, via something called Approximate Dynamic Programming or Reinforcement Learning. But this is a story for another night…

#### Value Iteration

From the previous discussion, we learned that iterating the Bellman operator convergences towards the optimal the optimal value function. Thus, we arrive at our first algorithm: value iteration. Its main idea is actually quite simple: to convert the Bellman equation into an update rule.

v \gets \mathcal{B}v.

We can thus start with an initial value function v_0 and iterate the update rule above. By the magic of the Banach Fixed Point theorem, this will converge towards the optimal value function no matter what the initial value function is. This procedure repeats until the uniform error \| v - \mathcal{B}v \|_\infty becomes less than a previously set tolerance (for which we have an estimate of necessary iterations).

By efficiency reasons, it is customary to represent the value function not as a function but using some data structure (usually a vector). The memoization that people associate with dynamic programming lies entirely in this “trick”. However, it is good to keep in mind that this is only a matter of computational representation of functions and is totally orthogonal to any algorithm’s design. In a language with first-class functions, it is possible to do dynamic programming using only function composition. It is just possible that it will not be as fast as one would like.

In value iteration, we obtain an optimal policy from the value function by keeping track of the \argmin whenever we solve an optimization problem. Below we see a Julia implementation of value iteration.

# The dynamics total cost of choosing action a at stage s,
# given a future cost estimate v.
function total_cost(v, s, a)
cost(s, a) + γ*v[next(s, a)]
end

function value_iteration(v0 ; tol)
v  = copy(v0)
π  = Policy{States, Actions}()
maxerr = Inf
while maxerr > tol
maxerr = 0
for s in States
prev = v[s]
v[s], π[s] = findmin(a -> total_cost(v, s, a), Actions(s))
maxerr = max(maxerr, abs(v[s] - prev))
end
end
return π, v
end

In the animation below, we can see value iteration in action for the problem of escaping from a maze. In this model, each state is a cell in the grid and the actions are the directions one can take at that cell (neighbours without a wall). The objective is to reach the right-bottom edge in the minimum amount of steps possible. We do so by starting with a uniformly zero value and a random policy. In the left we see the value function at each iteration and in the right the associated policy.

The algorithm above is in fact just one variation of value iteration. There are still many problem-dependent improvements one can make. For example, we chose to update v in-place, already propagating the new value function while traversing the states, but we could instead have kept the old value function and only updated v after traversing all states. Our approach has the advantage of using the improved information as soon as it is available but updating in batch may be interesting when we’re able to broadcast the optimization across many processes in parallel.

Other important choice we have is the initial value function. Choosing a good warm start can greatly improve the convergence. As an example, whenever there is a terminal state \blacksquare, it is a good idea to already fill v(\blacksquare) = 0. Finally, the order that we traverse the states matter. There is a reason why dynamic programming is famous for solving problems backwards. If we know that a given state is easier to solve, we should start the traverse by it. A specialization that we will further explore in a bit.

Well, we have a lot of options… Nevertheless, as long as we keep visiting all states, any of those approaches is guaranteed to converge towards the optimal value function, which one is faster being generally problem-dependent. This is why I think it is best to think of DP not as an algorithm but as a principle that encompasses many similar algorithms.

##### Backward Induction over a Finite Horizon

Let’s take look at a typical scenario where we can exploit the state space structure to make value iteration much faster: problems with a fixed finite horizon.

When the dynamics ends at a certain number N of time steps, we say that the problem has a finite horizon. In this case, the stage t is part of the state variable, (Because how else would we know when to stop?) and there is a terminal state \blacksquare with zero cost and representing anything that happens after the dynamics end. Essentially, we work for N stages and then reach \blacksquare, where we can just relax and do nothing for the rest of eternity.

If you prefer equations over figures, a finite horizon state machine is one where the transition function looks like this:

\bar{T}((t, s),\, a) = \begin{cases} (t + 1, T(s, a)), & t < N \\ \blacksquare, & t = N. \end{cases}

But what is so special about finite horizons? After all, the equation above seems much more confuse than what we had before. Well… what we gain is that the state space \mathcal{S} is clustered into smaller spaces that are visited in sequence: \mathcal{S}_1, \mathcal{S}_2, \ldots, \mathcal{S}_N, \mathcal{S}_{N+1} = \{\blacksquare\}.

The backward induction algorithm consists of value iteration but exploiting the structure we just discussed. At the terminal state, the cost is always zero, so we can set v(\blacksquare) = 0. But this, means that the future cost for all states in \mathcal{S}_N is fixed, and the optimal policy for them is just to choose the cheapest action. Now that we know which action to choose in \mathcal{S}_N, the future is fixed for all states in \mathcal{S}_{N-1}, reducing the problem in them to the routine we just did. We repeat this procedure until we reach the first stage. This calculates an optimal policy by induction, but moving “backwards” in time.

In Julia code, the algorithm looks like this:

function backward_induction()
v  = Values{States}()
π  = Policy{States, Actions}()
for t in N:1
for s in States(t)
v[s], π[s] = findmin(a -> total_cost(v, s, a), Actions(s))
end
end
return π, v
end

I recommend you to compare this with the generic value iteration in order to see what we gain. One thing should be clear: backward induction does the exact same operation as value iteration for each state, but only requires a single pass over the state space. This makes it much more efficient. Another more subtle detail is that since we have more structure to exploit in the dynamics, the algorithm doesn’t have to assume so much about the Bellman operator. For example, although hand-wavy, we just gave a proof of convergence above that has no dependence on the Banach fixed-point theorem. Thus, as a bonus, backward induction works for any discount factor \gamma.5

#### Policy Iteration

One issue with value iteration is that all policy calculations are implicit, since we just work with value functions. It is then possible that we reach an optimal policy and keep iterating the algorithm to improve the estimate for the value function. In this section, let’s see how we can directly calculate an optimal policy in a finite number of steps.

##### Policy Evaluation

Our next question is then how to calculate the cost associated with a policy. Let’s say somebody gave you a policy \pi and told you nothing more about it. How can you find its value function v^\pi? One way is to notice that it satisfies a recursion similar to the Bellman equation, but without the minimization step.

v^\pi(s) = c(s, \pi(s)) + \gamma v^\pi(T(s, \pi(s)).

If you follow the same steps we did before transforming this equation into a fixed point problem, you will see that under the same assumptions of continuity (always valid for finite state and action spaces) it also has a unique solution. Moreover, turning it into an update procedure converges towards v^\pi for any initial value function. This way, we arrive at an algorithm for evaluating the cost of a policy, unimaginatively called policy evaluation.

function policy_evaluation(π, v0=zeros(States); tol)
v  = copy(v0)
maxerr = Inf
while maxerr > tol
maxerr = 0
for s in States
prev = v[s]
v[s] = total_cost(v, s, π(s))  # same total cost function as defined for value iteration
maxerr = max(maxerr, abs(v[s] - prev))
end
end
return v
end

Notice the similarity with value iteration. The only difference is on the update rule: instead of choosing an optimal action, we just follow along with the policy.

##### Policy Improvement

After we know a policy and its value function, our next question is how to improve it. That is, how can we use this information to get nearer to an optimal policy.

The secret lies in locally improving our policy for each state. Consider a state s. The value v^\pi(s) is the total cost of starting at s and following \pi thereafter. What if there is an a \in \mathcal{A}(s) such that choosing a at the first step and following \pi thereafter is cheaper than just following \pi?

c(s, a) + v^\pi(T(s, a)) < v^\pi(s).

Since we have this improvement at the first step and our processes are assumed to not depend on anything besides what is represented on the state s, then it must be better to choose a than \pi(s) whenever we are at state s. That is, if we define a new policy

\lambda(x) = \begin{cases} a, & x = s \\ \pi(x), & \text{otherwise} \end{cases}

it is always better to follow \lambda then \pi, because v^\lambda \le v^\pi.

We thus arrive at our next algorithm: policy iteration. It consists of taking a policy \pi, finding its value function v^\pi through policy evaluation and finally using policy improvement to arrive at a better policy. Since there are only finitely many policies, and we always obtain a strictly better policy, this algorithm is guaranteed to converge to an optimal policy in a finite amount of steps.

function policy_improvement(π0, v)
π = copy(π0)
for s in States
π[s] = argmin(a -> total_cost(v, s, a), Actions(s))
end
return π
end

function policy_iteration(π, v=zeros(States); tol)
while true
π0 = π
v  = policy_evaluation(π, v; tol=tol)
π  = policy_improvement(π, v)
if π == π0 break end
end
return π, v
end

Did you notice that we iteratively alternate between two kinds of passes over the states? This is the mother of all backwards-forwards algorithms people generally associate with dynamic programming.

Just like with value iteration, there’s also a lot of freedom in how we traverse the states. Again it is useful to think of policy iteration more as a principle than as an algorithm in itself and adapt the steps to consider any problem specific information that may be available.

## What Can Dynamic Programming solve?

Ok, it’s time to recapitulate a bit. We started our journey looking at automata and controllable dynamics, showing how we can represent the best set of actions as those who fulfill a certain recursive relation on their value functions, called the Bellman equation. Then, we explored a couple ways of solving this equation through methods for finding fixed points. Now, a question for the reader: how much do you think those methods actually depended on the Bellman equation itself?

If you go back to our algorithms for value iteration, backwards induction, or policy iteration, you will notice that the Bellman operator itself only appears in a single line. Furthermore, this apparition is completely encapsulated. The algorithms do not depend on the Bellman operator per se; they work because it satisfies a couple reasonable properties!

Very well, the mathematician inside me sees something like that and starts thinking about generalizing. Let’s see where a bit more abstraction leads us.

Value iteration is simply Banach’s fixed point in disguise. Hence, it is a tool capable of solving any problem described by a recursive equation where the operator is a monotone contraction. Similarly, we can use policy iteration whenever the operator associated with following a policy shows a monotone contraction behavior. To be fair, one could go even further and substitute the hypothesis from Banach’s theorem for any other theorem guaranteeing an attractor among the value functions. As this can get rather technical, We will not explore this in here, but if you are curious (and a big fan of abstract books), a great place to start is Dimitri P. Bertsekas6’s book.

Alright, this post is becoming too abstract for my taste. Let’s dive into a couple examples to see why this generalization is cool.

### Stochastic Dynamic Programming

Until now, we’ve only dealt with deterministic processes. Life, on the other hand, is full of uncertainty and, being a nice applied field, dynamic programming was created from its inception keeping in mind stochasticity.

We call a state machine where the transitions T(s, a) are stochastic a Markov Decision Process (MDP for short). This name comes from the fact that the new state only depends on the current state and action, being independent of the process’ history; Just like a Markov chain. A usual intuition for this kind of processes is as the interaction between an actor and an environment. At each time step, the environment is at a state s (which is known to the actor), and the actor may choose among different actions a \in \mathcal{A}(s), each one incurring a certain cost7, to interact with it. This action affects the environment in some way that is out of reach to the actor (thus stochastic / non-deterministic), changing its state to s' = T(s, a). This is illustrated in the diagram below.8

Allowing non-determinism opens the way for modeling a lot more cool situations. For example, robots that play video games! The states may be the internal state of the game or some partial observation of them that is available to the player and the actions are the buttons on the controller. The transitions are internal to the game and the costs are related to some winning/losing factor. Have you ever heard of Volodymyr Mnih et al.9’s Playing Atari with Deep Reinforcement Learning paper? In it, they use reinforcement learning to train a robot capable of playing Atari 2600 games and all modeling is done via Markov decision processes in a way that is really similar to the discussion in this paragraph. I really recommend checking it out.

With a stochastic environment, we can’t necessarily predict the total costs and transitions, only estimate them. To accommodate that, we will have to change a few things in our definition of policies and value functions.

Even if fix a policy \pi, we cannot be certain of what is going to happen, because the costs and transitions are stochastic. We are, however, able to calculate the average cost by taking the expected value among all possible outcomes, conditioned to following the policy \pi.

v^\pi(s) = \mathbb{E}\left[ \sum\limits_{t=1}^\infty \gamma^{t-1}c(s_t, a_t) \middle| s_1 = s,\, s_{t+1} = T(s_t, a_t),\, a_t = \pi(s_t) \right]

Using that the expected value is linear, one can redo the reasoning from the deterministic case to arrive at a stochastic Bellman equation:

\begin{array}{rl} v^\star(s) = \min\limits_{a} & c(s, a) + \gamma \mathbb{E}\left[v^\star(s') \right] \\ \textrm{s.t.} & s' = T(s, a), \\ & a \in \mathcal{A}(s). \end{array}

For finite state spaces, this equation has the same contraction properties as the deterministic one. Hence, the tools of value and policy iteration are also readily available to solve MDPs. The optimization problems are harder, because of the expected value, but conceptually it is the same.

### Solving Recurrence Relations

This last example is actually pretty simple and using the full power we’ve developed is a real overkill. Nevertheless, as it is a rather common first example to encounter when learning Dynamic Programming in Computer Science classes, it is worth it to view this kind of problem in the light of the formalism we’ve developed.

A recurrence relation is any function \N \to \mathbb{R}, where the nth term is recursively defined through the previous terms.

f(n) = \begin{cases} c_n, & n < k \\ g(n, f(n-1), \ldots, f(n-k)), & n \ge k. \end{cases}

The c_n are constants which correspond to the base case of the recursion the value of f(n) depends directly on the previous k terms.

Famous examples with this structure are the factorial and Fibonacci functions:

\begin{align*} \mathrm{fat}(n) &= \begin{cases} 1, & n = 0 \\ n \cdot \mathrm{fat}(n-1), & n \ge 1. \end{cases} \\ \mathrm{fib}(n) &= \begin{cases} 0, & n = 0 \\ 1, & n = 1 \\ \mathrm{fib}(n-1) + \mathrm{fib}(n-2), & n \ge 2. \end{cases} \\ \end{align*}

In general, evaluating f(n) directly via the definition may be exponentially slow. But, in fact, it is always possible to calculate it in linear time using dynamic programming.

To calculate f(n), we consider a state machine where the states are the numbers from 0 to n and where there is only a dummy action \star. Then, we recast the recursion as an operator

(\mathcal{B}v)(n) = \min_{a \in \{\star\}} \begin{cases} c_n, & n < k \\ g(n, v(n-1), \ldots, v(n-k)), & n \ge k. \end{cases}

Since, there is only a single action, the minimization is redundant and the optimal value function satisfies the desired recursive equation. Thus, under very mild conditions our assumptions hold, and we can solve this equation by finding the fixed point of \mathcal{B}. As an example, the following video shows the first 15 Fibonacci numbers being calculated via value iteration.

We can improve this method even more by using Backwards Induction! To evaluate f(n), we constructed a finite horizon problem where there is only a single state per stage. Thus, considering that our initial state is n, we just have to fill the value function starting from 0 (which means going backwards, in this case) and we evaluate the recursion in n steps. Below there is an example solving the Fibonacci equation, but taking into account the order of states.

## End of our Journey

Well, we finally reached the end of our overview of dynamic programming. I sincerely hope it was as fun for you to read as it was for me to write. And that DP gets the honor place it deserves in your problem solving toolkit!10

Of course, a single blog post is too tinny to encompass a subject as vast as DP. There are matters about estimating the value function instead of simply calculating it, infinite state spaces, continuous time and a plethora of cool stuff we can do. There are also a lot of connections with reinforcement learning, for which we only scraped the surface in this post. Unfortunately, these will remain as stories for another night.

Farewell and see you next time!

## Acknowledgements

This post come to life after a series of conversations I had with Pedro Xavier. The good thing of explaining something to a smart person is that you end up learning a lot in the process. Sometimes you even learn enough to write a blog post about it.

I’m also in debt with Ivani Ivanova for being such a great typo hunter. If there is any typo left, it is because I’m lazy… She did a marvelous job.

## Appendix (Proofs of Convergence)

In this appendix we show that the Bellman Operator

\begin{array}{rl} (\mathcal{B}v)(s) = \min\limits_{a} & c(s, a) + \gamma v(s') \\ \textrm{s.t.} & s' = T(s, a), \\ & a \in \mathcal{A}(s) \end{array}

satisfies all the requisites to be a contraction over the space of bounded continuous functions.

To prove that \mathcal{B} is a contraction, we will start with another of its properties: monotonicity. Besides this proof, it will also be useful in the future when we talk about policy improvement.

Suppose that we have two value function v and w such that v estimates costs uniformly lower than w:

v(s) \le w(s),\; \forall s \in \mathcal{S}.

Then the Bellman operator preserves this relationship: (\mathcal{B}v)(s) \le (\mathcal{B}w)(s), \; \forall s \in \mathcal{S}.

Given a state s, we get from v \le w that the following inequality holds for any action a \in \mathcal{A}(s):

c(s, a) + \gamma v(T(s,a)) \le c(s, a) + \gamma w(T(s,a)).

Since this is valid for any a, taking the minimum on both sides preserves the inequality. Thus

\min_{a \in \mathcal{A}(s)} c(s, a) + v(T(s,a)) \le \min_{a \in \mathcal{A}(s)} c(s, a) + w(T(s, a)) \\ (\mathcal{B}v)(s) \le (\mathcal{B}w)(s).

Finally, let’s prove that the Bellman operator contracts the space of bounded continuous functions by the discount factor. What we need to show is that for any value function v, w:

\|\mathcal{B}v - \mathcal{B}w\|_\infty \le \gamma \|v - w\|_\infty.

From the definition of the uniform norm, we get that for any state s,

v(s) - w(s) \le \|v - w\|_\infty \\ v(s) \le w(s) + \|v - w\|_\infty.

From the monotonicity we just proved, applying \mathcal{B} to both sides preserves this inequality:

(\mathcal{B}v)(s) \le \mathcal{B}(w + \|v - w\|_\infty)(s).

Let’s show that the constant factor \|v - w\|_\infty only shifts the new function \mathcal{B}w by uniformly by another constant. Calling it k to declutter the notation,

\begin{array}{rlll} (\mathcal{B}(w) + \|v - w\|_\infty)(s) &= &\min\limits_{a} & c(s, a) + \gamma (w(s') + \|v - w\|_\infty) \\ &&\textrm{s.t.} & s' = T(s, a), \\ && & a \in \mathcal{A}(s) \\ &= &\min\limits_{a} & c(s, a) + \gamma (w(s')) + \gamma \|v - w\|_\infty \\ &&\textrm{s.t.} & s' = T(s, a), \\ && & a \in \mathcal{A}(s). \end{array}

This proves that

\begin{aligned} (\mathcal{B}v)(s) &\le (\mathcal{B}w)(s) + \gamma \|v - w\|_\infty \\ (\mathcal{B}v)(s) - (\mathcal{B}w)(s) &\le \gamma \|v - w\|_\infty. \end{aligned}

By doing the same derivation in the opposite direction (for w - v) we get an inequality for the absolute value. Applying the supremum, it becomes the result we want.

\begin{aligned} |(\mathcal{B}v)(s) - (\mathcal{B}w)(s)| &\le \gamma \|v - w\|_\infty \\ \sup_{s\in\mathcal{S}} |(\mathcal{B}v)(s) - (\mathcal{B}w)(s)| &\le \gamma \|v - w\|_\infty \\ \|\mathcal{B}v - \mathcal{B}w\|_\infty &\le \gamma \|v - w\|_\infty. \end{aligned}

Since \mathcal{B} is a contraction, the Banach fixed point theorem guarantees to us that there exists a unique value function v^\star satisfying the Bellman equation. Furthermore, the theorem also points us towards a way to solve this kind of procedure with polynomial complexity on the size of the state and action spaces. This is the topic we’re going to investigate next.