A Tale of Dynamic Programming

25 June 2022

What if I told you that some of the most used algorithms to find the shortest path in a graph, calculate gradients while training a neural network, and parse context-free grammars are essentially implementations of the same principle? It is called dynamic programming and is one of those instances in mathematics where a simple principle unfolds into profound conclusions ranging over many fields. In fact, we can, already in this first paragraph, summarize the idea using Richard Bellman’s (Dynamic Programming’s creator) own words:

An optimal policy has the property that whatever the initial state and initial decision are, the remaining decisions must constitute an optimal policy with regard to the state resulting from the first decision.

I have to admit that despite encountering dynamic programming in different contexts, it took me a while to finally get the “click” that they were actually the same thing. When learning algorithms and data structures, it was a memoization-based technique where you could speed up some algorithms by first solving the easier parts and storing the solution for later use. Then, at work, I mostly deal with solving a lot of linear programs for long-term scheduling problems.¹ The main algorithm we use, called Stochastic Dual Dynamic Programming, at first didn’t seem so much like the programming technique from the algorithms class. Finally, one of the main methods for model-based reinforcement learning is again called dynamic programming, and it also didn’t seem so much like the other instances.

So, what’s happening here? Did everybody choose to call their algorithms dynamic programming just because it’s a cool name?² Well, in fact there are some principles that apply to all of those instances, from planning a rocket’s trajectory to TeX’s word-wrapping. And the list goes on and on.

I want to invite you to a journey through many realms of mathematics. We will range from automata to optimal control, passing through Markov chains, dynamical systems, linear programming and even metric spaces. Take your seat and enjoy the ride!

On Decision-Making and State Machines

Before delving into dynamic programming per se, we first have to establish a few concepts. After all, it’s always best to know which problems you intend to solve before learning a method to solve them, right?

As a matter of motivation, let’s start with something I am really fond of: old school plataformer games. In our hypothetical game which is definitely not about some Italian plumber, the character stands idle doing nothing by default. But with the press of a button in the controller, the player may command the character to do a few things: shoot, jump, or walk. And, of course, each of these actions activate the respective animation on the screen. In the best Resident Evil style, this game only allows a character to shoot while idle and forces you to first be idle after a jump before doing any other action. Think of that as the time it takes to restore one’s balance after falling. This description may seem overly complicated on text, but fortunately the nice folks in the Comp Sci department already invented diagrams that show these transitions nicely.

Our modeling above is an instance of something called a state machine or automata if you’re into Greek words. There are 4 states in which the character might be and at each one there is an available set of actions to take that transitions that state. More abstractly, an automaton is a system that can be in one of many states s \in \mathcal{S} and at each state, you can choose among a set of actions a \in \mathcal{A}(s). Whenever you take an action, the system changes to a new state according to a transition function

T : (s : \mathcal{S}) \times \mathcal{A}(s) \to \mathcal{S}.

Unfortunately life is not known for its free lunches and, in general, whenever one takes action a at state s, it is necessary to pay a certain cost, properly modeled as another function

c : (s : \mathcal{S}) \times \mathcal{A}(s) \to \mathbb{R}.

Depending on the context this can be, for example, a real monetary cost (in economic contexts), some total distance or elapsed time (for planning) or even a negative cost representing a reward.

The Dynamics of Decision-Making

Iterating the transition T establishes a dynamics for our system: by starting at an initial state s_0 and taking a sequence of actions \{a_t\}, we generate a trajectory over the state space.

s_{t+1} = T(s_t, a_t).

When viewed in this light, our state machines are called controllable dynamical systems or decision processes, which are yet additional cool names for you to memorize.

One can argue that a state encapsulates all you must know about your system in order to choose an action, no matter the previous history nor time step. Indeed, if any other thing affects your choice, you can, without loss of generality, model the process as a larger automaton where the state also carries the additional information. Thus, controlling a dynamic system amounts to selecting a valid action for each state, that is, a function

\pi : (s : \mathcal{S}) \to \mathcal{A}(s).

In the literature this is called a policy, in analogy to a government taking actions to control the state of the nation.

Starting at state s_0 and following a policy \pi produces a deterministic dynamical system without the need for choosing a control:

s_{t+1} = T(s_t, \pi(s_t)).

This dynamics, in counterpart, yields a cost c(s_t, \pi(s_t)) for each time step. We could define the total cost for \pi as the sum of those costs, but there is an additional detail to notice. Suppose that, for any reason, money is short and you had to take a loan in order to pay your bills. In those struggling conditions, would you prefer to pay the money back today or next year?

Sometimes there are factors such as inflation or interests making future costs have a real value that is different from its nominal value. This prompts us to introduce a problem dependent discount factor \gamma \in [0, 1] representing how much the cost depreciates over time. The total cost of following a certain policy \pi is the cumulative sum of all properly discounted costs we generate by following it. We define the value function v^\pi : \mathcal{S}\to \mathbb{R} associated with \pi as the total cost of starting at a given state:

\begin{array}{rl} v^\pi(s) = & c(s_0, \pi(s_0)) + \gamma c(s_1, \pi(s_1)) + \gamma^2 c(s_2, \pi(s_2)) + \ldots \\ \textrm{where} & s_0 = s, \\ & s_{t+1} = T(s_t, \pi(s_t)), \\ \end{array}

Besides from its practical interpretation, the discount factor \gamma also plays a significant role from the analytical point of view. If |\gamma| < 1 and the costs are uniformly bounded (which is the case for a finite action space, for example) we can guarantee that the series defining v^\pi converges for any choice of actions and initial state. That is, suppose there exists M > 0 such that

\forall s \in \mathcal{S}, a \in \mathcal{A}(s),\, |c(s, a)| \le M.

This bounds the total cost by a geometric series that cannot blow up,

\sum\limits_{t=0}^\infty \gamma^{t}|c(s_t, a_t)| \le \sum\limits_{t=0}^\infty \gamma^{t} M \le \frac{M}{1 - \gamma},

Thus guaranteeing that the value function is well-defined.

Optimal Decisions

Having multiple possible courses of action prompts us to ask which one is the best. When programming a robot to escape a labyrinth, you want it to take the least amount of time; When controlling a spaceship towards the moon, it is important to guarantee that it will use the least amount of fuel; When brawling at a bar, you want to knock out your foe while sustaining the least injuries possible. Most of all, the best policy is the one with the least cost taking all time into account — both the present and its future consequences. For example, sometimes a policy that has a higher cost for the first state is overall better because it puts us into a more favorable state. Thus, our problem can be naturally formulated as searching for optimal policies:

Starting at state s, find a policy \pi producing the least total cost over time.

Or equivalently in math language:

\begin{array}{rl} \min\limits_\pi v^\pi(s) = \min\limits_{a_t} & \sum\limits_{t=0}^\infty \gamma^{t}c(s_t, a_t) \\ \textrm{s.t.} & s_0 = s, \\ & s_{t+1} = T(s_t, a_t), \\ & a_t \in \mathcal{A}(s_t). \end{array}

Right now, this may seem like a big and scary optimization problem but in fact it contains a lot of structure we’re able to exploit. That is the subject of the next section. But, before we continue, let’s go over a little tangent on how to formulate some classical problems in this decision-making framework.

Example: Shortest Path in a Graph

Suppose you are at your hometown and just received a message from a friend telling you that there are singing llamas in Cuzco, Peru, right now. This makes you at the same time incredulous and curious, so you just pick your favorite bike and get on the road towards Cuzco. Unfortunately there are no direct bikeways connecting your home to Cuzco, meaning that you will have to find a route going through other cities. Also, there is a risk that the llamas will stop to sing at any time and just go back to their usual behavior of eating grass throughout the mountains. This prompts you to decide to take the shortest possible path to Cuzco.

The above description is an instance of finding the shortest path in a graph. In it, we represent each city by a graph node and direct routes between two cities as a weighted edge where the weight is the distance. Going from home to Cuzco amounts to finding the path between those two nodes with the smallest total distance.

The translation from this graph description to a decision process is quite straightforward.

States: nodes in the graph.
Actions at state s: edges going from s to another node.
Transition: the opposite node on the same edge. That is, given an edge s \to s', T(s, s \to s') = s'.
Costs: c(s, a) is the weight of edge a — the time taken traveling through that edge.

Finding the shortest path from s to z is the same as setting the initial state to s and making z a terminal state of our dynamics.

Dynamic Programming

Alright, it’s finally time to start optimizing those decision problems. The simplest idea would be to exhaustively search the space of all actions trying to find the best solution. Notice that even for finite states and horizon, this may be prohibitively expensive since the possible candidates grow exponentially with the time steps. Any practical method must take into account how this class of problems naturally breaks apart into separate stages.

Our approach will involve the famous Bellman principle of optimality, which is the cornerstone of dynamic programming. Taking Richard E Bellman³’s own words, it reads as:

An optimal policy has the property that whatever the initial state and initial decision are, the remaining decisions must constitute an optimal policy with regard to the state resulting from the first decision.

Alright, what does this mean? What the principle of optimality is telling us is that in order to calculate an optimal policy, we should turn this iterative process of making actions and calculating costs into a recursive procedure. That is, taking an action a at the initial state s puts us into a new state s' = T(s, a) where we are again faced with the exact same problem of finding an optimal policy, but this time starting at s'. Let’s see how we can exploit this idea.

Remember that we defined the value function v^\pi as the total cost of following a policy \pi when starting at a given state. Let’s define the optimal value function v^\star as the total cost of choosing the best course of action while starting at a certain state s.

\begin{array}{rl} v^\star(s) = \min\limits_{a_t} & \sum\limits_{t=0}^\infty \gamma^{t}c(s_t, a_t) \\ \textrm{s.t.} & s_0 = s, \\ & s_{t+1} = T(s_t, a_t), \\ & a_t \in \mathcal{A}(s_t). \end{array}

Notice in the optimization problem above that the initial state is only ever used to choose the first action. Later actions do not depend directly on it but only on its consequences. This means that we can break the problem into two parts: calculating an immediate cost dependent only on the initial state and calculating a future cost dependent on all the ensuing states.

\begin{array}{rl} v^\star(s) = \min\limits_{a,a_t} & {c(s, a)} + \left( \begin{array}{rl} \min\limits_{a_t} & \sum\limits_{t=1}^\infty \gamma^{t}c(s_t, a_t) \\ \textrm{s.t.} & s_1 = s', \\ & s_{t+1} = T(s_t, a_t), \\ & a_t \in \mathcal{A}(s_t) \end{array} \right) \\ \textrm{s.t.} & s' = T(s, a), \\ & a \in \mathcal{A}(s). \end{array}

There’s already some recursive structure unfolding in here! What is still missing consists of noticing that since the sum in the future cost starts at t = 1, we can factor out \gamma. By renaming l = t-1 we get

\sum\limits_{t=2}^\infty \gamma^{t-1}c(s_t, a_t) = \gamma \sum\limits_{t=2}^\infty \gamma^{t-2}c(s_t, a_t) = \gamma \sum\limits_{l=1}^\infty \gamma^{l-1}c(s_l, a_l),

and applying this in the expression for v^\star,

\begin{array}{rl} v^\star(s) = \min\limits_{a} & c(s, a) + \gamma\left( \begin{array}{rl} \min\limits_{a_l} & \sum\limits_{l=0}^\infty \gamma^{l}c(s_l, a_l) \\ \textrm{s.t.} & s_0 = s', \\ & s_{l+1} = T(s_l, a_l), \\ & a_l \in \mathcal{A}(s_l) \end{array} \right) \\ \textrm{s.t.} & s' = T(s, a), \\ & a \in \mathcal{A}(s). \end{array}

Although this is a huge expression, it should be straightforward to see that the expression for the future cost is exactly the optimal value v^\star(s') of starting the dynamics at s' = T(s, a). This way, the principle of optimality express itself mathematically as a recursive equation that the value for an optimal policy must satisfy.

\boxed{ \begin{array}{rl} v^\star(s) = \min\limits_{a} & c(s, a) + \gamma v^\star(s') \\ \textrm{s.t.} & s' = T(s, a), \\ & a \in \mathcal{A}(s). \end{array} }

This is called the Bellman equation and all dynamic programming consists of methods for solving it. Even more: we can think of the Bellman equation as a recursive specification for the decision problems and of dynamic programming as any problem-specific implementation that solves it.

Existence, Uniqueness and Fixed Points

It is time to get deeper into analysis. Whenever mathematicians see a recursive relation such as the Bellman equation, they immediately start asking such things: what guarantees do we have about v^\star? Can I trust that it is unique? Does it even exist? Surely we mathematicians may seem a bit too anxious with all these questions, but they are for good reasons. Besides the guarantee that everything works, in this case proving the existence of a solution also teaches us how to construct it! So pay attention, because in the next section we’re going to adapt the theorems in here into algorithms to solve the Bellman equation.

Recursion has a deep relationship with fixed points, allowing us to use whichever is more practical to our goals. To solve a problem with dynamic programming, our first step will be to write the Bellman equation as the fixed point of an operator \mathcal{B}: (\mathcal{S}\to \mathbb{R}) \to (\mathcal{S}\to \mathbb{R}) called — guess what — the Bellman Operator. It takes value functions to value functions and is defined as

\begin{array}{rl} (\mathcal{B}v)(s) = \min\limits_{a} & c(s, a) + \gamma v(s') \\ \textrm{s.t.} & s' = T(s, a), \\ & a \in \mathcal{A}(s). \end{array}

Now, the optimal value function v^* is the fixed point We thus reduce the question of existence and uniqueness of solutions for the Bellman equation to finding fixed points of \mathcal{B}:

v^* = \mathcal{B}v^*.

If you are not accustomed to fixed points, the transition above from the Bellman equation to operator may seem strange. Hence, let’s go through a little story in order to develop some intuition.

Imagine that you are the king/queen of a fantastical kingdom. You are an absolute monarch and whatever action you decide to take, your subjects will follow. Lately, the kingdom’s reserves are running dry and your counselors advised you to define a clear governing policy in order to minimize the kingdom’s spending. Since besides a ruthless ruler you’re also a great mathematician and a fan of this blog, at this point you already know what must be done to save your kingdom from bankruptcy: solve the Bellman equation.

Because at this point of the post you still don’t know how to solve it, you decide to take advantage of your fantastical world and hire a wizard to look into his crystal ball and act as an oracle telling you how much each state of affairs will cost to the kingdom in the future. After God knows how many rituals and incantations, the wizard hands you a shining new value function \char"1f52e: \mathcal{S}\to \mathbb{R}.

Since it is never wise to blindly follow a crystal ball’s advice, you decide to use it only to predict the future, while relying on your discernment for taking immediate decisions. In other words, you rule your kingdom by solving the optimization problem

\begin{array}{rl} (\mathcal{B}\char"1f52e)(s) = \min\limits_{a} & c(s, a) + \gamma \char"1f52e(s') \\ \textrm{s.t.} & s' = T(s, a), \\ & a \in \mathcal{A}(s). \end{array}

Thus, the process of going from prediction to decision is precisely the Bellman operator. The function \mathcal{B}v is the cost of choosing the best action taking v as your future estimate.

A Useful Theorem

Alright, we have transformed the problem of finding an optimal decision into solving the Bellman equation, and then transformed it again into finding a fixed point to the Bellman operator,

\mathcal{B}v^\star = v^\star.

If this seems as complicated — although more abstract — as where we started, have no fears! We just arrived at a point where we can invoke a powerful theorem from Analysis to solve all our issues at once. Behold the Banach Fixed Point Theorem!

In a complete metric space (M,\mathrm{d}), any continuous f : M \to M that decreases the distance between points:

\mathrm{d}(f(v), f(w)) \le \gamma \mathrm{d}(v, w),\; \textrm{for } \gamma \in [0, 1),

Has a unique fixed point v^\star.

Furthermore, one can arrive at v^\star from any initial value v_0 by iterating f:

\lim_{n \to \infty} f^n(v_0) = v^\star,\; \forall v_0 \in M.

This procedure converges linearly with the error at each iteration bounded as

\mathrm{d}(v_n, v^*) \le \frac{\gamma^n}{1 - \gamma} \mathrm{d}(v_0, f(v_0)).

Proving this theorem is out of scope for this post⁴. However, we can think of it as saying that if a mapping shrinks all distances, then eventually the image of all points will be squeezed into a single point.

What we care the most about this theorem is that it gives us a constructive recipe to find fixed points on metric spaces. All we have to do is to interpret the fixed point equation as an update rule,

v \gets f(v),

And iterate it until the distance converges to below a certain tolerance. It is then straightforward to convert the description above into a computational procedure.

function fixed_point(f; v0, tol)
  v = f(v0)
  while distance(v, v0) > tol
    v0 = v
    v  = f(v)  # Update rule
  end
  return v
end

A Metric Space of Value Functions

To apply the Banach fixed point theorem to the Bellman operator, we must find a suitable function space where \mathcal{B} is a contraction. A fitting choice are the bounded continuous functions over the states, C^0_b(\mathcal{S}, \mathbb{R}) with distance given by the uniform norm

\mathrm{d}(v, w) = \|v - w\|_\infty = \sup_{s \in \mathcal{S}} |v(s) - w(s)|.

In this (complete) metric space, \mathcal{B} turns out to be a contraction with factor \gamma (yes, the discount factor). Furthermore, for finite state spaces, any function is continuous and bounded, meaning that the above encompasses all algorithms of interest in this post.

Since I don’t want to depart too much from the post’s main topic nor dive into mathematical minutiae, we are going to relegate the necessary proofs to an appendix.

Any decision process with discount factor \gamma < 1 has a unique optimal value function v^\star satisfying the Bellman equation

\begin{array}{rl} v^\star(s) = \min\limits_{a} & c(s, a) + \gamma v^\star(s') \\ \textrm{s.t.} & s' = T(s, a), \\ & a \in \mathcal{A}(s). \end{array}

Furthermore, you can calculate the (not necessarily unique) optimal policy via

\begin{array}{rl} \pi^\star(s) = \argmin\limits_{a} & c(s, a) + \gamma v^\star(s') \\ \textrm{s.t.} & s' = T(s, a), \\ & a \in \mathcal{A}(s). \end{array}

It all follows from applying the Banach fixed point theorem to the Bellman operator.

The above result is what I like to call a “bazooka theorem”, because besides guaranteeing the existence and uniqueness of an optimal value function — and consequently an optimal policy — it also teaches us how to calculate it for finite states, as we will shortly see in the ensuing section.

Solving the Bellman Equation

Dynamic Programming consists of solving the Bellman Equation, and, as with all famous equations, there are many possible approaches. Which one to choose will depend on the problem and hardware at hand.

From now on, let’s assume that both the state \mathcal{S} and action \mathcal{A}(s) spaces are finite. This allows us to focus on exhaustive methods exploring the entire state space. There are other methods, such as Reinforcement Learning or Dual Dynamic Programming, which are able to generalize the ideas in here to infinite spaces. But this is a story for another night…

Before jumping into the algorithms, it is worth discussing a couple of technical decisions we must make in order to implement them.

This first thing we must notice is that although the most straightforward way to represent functions in a programming language is through computational procedures⁵, it would be quite inefficient in our case. This happens because altering procedures for improvement is computationally expensive. Thus, by efficiency reasons, it is customary to represent the policy and value function not as functions but using some other data structure. Since our state space is finite, there is a wide range of data structures capable of exactly representing such functions. Common choices are arrays or hash maps, but you can really use anything capable of storing coefficients. Below we see some examples with these types for in-memory storage.

# Storage with vector / array
# WARNING: This requires some method idx : States -> Int for later indexing
function asarray(f :: Function)
  return [f(s) for s in States]
end

# Storage with hash map / dictionary
function asdictionary(f :: Function)
  return Dict(s => f(s) for s in States)
end

The memoization that people associate with dynamic programming lies entirely in this “trick”. However, it is good to keep in mind that this is only a matter of computational representation of functions and is totally orthogonal to any algorithmic design. In a language with first-class functions, it is possible to do dynamic programming using only function composition. It just so happens that it will not be as fast as one would like.

In the algorithms, we will write this choice of representation as two opaque types Values{States} and Policy{States, Actions}, assumed to deal with all boilerplate as needed.

Another important thing to mention is that we will also only interact with a process via its total cost, never touching its constituent parts separately⁶. Hence, let’s already build a function that does the conversion for us.

# Turn a decision problem into its respective cost function.
function total_cost(p :: Process)
  return (v, s, a) -> p.cost(s, a) + p.γ * v[p.next(s, a)]
end

Notice how we used square brackets to represent that we are accessing a data structure instead of calling a function.

Value Iteration

Thus, we arrive at our first algorithm: value iteration. Recall from the previous discussion that iterating the Bellman operator over any input converges towards the optimal value function. The algorithms main idea comes quite straightforwardly from it: convert the Bellman equation into an update rule to find its fixed point.

v \gets \mathcal{B}v.

We can thus start with any initial value function v_0 and iterate the update rule above. By the magic of the Banach Fixed Point theorem, this will converge towards the optimum. This procedure repeats until the uniform error \| v - \mathcal{B}v \|_\infty becomes less than a previously set tolerance.

Each iteration of our algorithm comes from evaluating the Bellman operator in our previously chosen representation. Let’s thus use our total_cost to write it.

# The Bellman operator corresponding to a decision process.
# It uses a storage representation `Values` for the value function.
function bellman_operator(prob :: Process)
  return function(v)
    Bv = Values{States}()  # Empty representation
    for s in States
      Bv[s] = minimum(a -> total_cost(prob)(v, s, a), Actions(s))
    end

    return Bv
  end
end

Finally, the algorithm consists of iterating the procedure generated by bellman until it converges to the fixed point. Afterwards, we calculate the policy as the optimal solution for the determined value function. The name value iteration is because it only uses the value function in the update process, with the calculated policy playing no role.

function value_iteration( prob :: Process     # Data for decision process
                        ; v0 = zeros(States)  # Warm start --- all zeros if you don't know any better
                        , tol)                # Stopping tolerance
  # The optimal value function is the fixed point of the Bellman Operator
  v_opt  = fixed_point(bellman_operator(prob); v0, tol)

  # The optimal policy is the choice of action for the total cost with the optimal value function.
  π_opt  = Policy{States, Actions}()
  for s in States
    π[s] = argmin(a -> total_cost(p)(v_opt, s, a), Actions(s))
  end

  return π_opt, v_opt
end

The algorithm above comes from directly implementing the Fixed Point Theorem, and, because of this, is guaranteed to converge linearly to the optimum. At each iteration, we do one evaluation of the Bellman operator requiring \mathrm{O}(|\mathcal{S}|\cdot|\mathcal{A}|) operations. Nevertheless, at each iteration, the minimization procedures happen independently for each state, making the evaluation of \mathcal{B}v embarrassingly parallel on the states.

In-place Value Iteration

Despite the parallelization opportunities shown by the previous implementation, it can feel too sluggish when implemented sequentially, because it waits until after traversing all states to update the value function. Another approach, better suited for a sequential machine, is to update the value function in-place in order to promptly propagate the improved information to the other states. The trade-off is that this approach is no longer able to broadcast the optimization across many processes in parallel.

Algorithmically speaking, the only required change is rewriting the fixed point iteration procedure to calculate in-place.

function fixed_point_inplace!(f, v; tol)
  maxerr = Inf
  while maxerr > tol
    maxerr = 0  # Start with smallest error possible
    for s in States
      prev = v[s]
      v[s] = f(v)[s]
      # Estimate ||f(v) - v||_∞ component by component
      maxerr = max(maxerr, abs(v[s] - prev))
    end
  end

  return v
end

function value_iteration_inplace!(prob, v0 = zeros(States) ; tol)
  v_opt  = fixed_point_inplace!(bellman_operator(prob), v0 ; tol)

  π_opt  = Policy{States, Actions}()
  for s in States
    π[s] = argmin(a -> total_cost(p)(v_opt, s, a), Actions(s))
  end

  return π_opt, v_opt
end

In the animation below, we can see in-place value iteration in action for the problem of escaping from a maze. In this model, each state is a cell in the grid and the actions are the directions one can take at that cell (neighbours without a wall). The objective is to reach the right-bottom edge in the minimum amount of steps possible. We do so by starting with a uniformly zero value and a random policy. In the left we see the value function at each iteration and in the right the associated policy.

The implementations above are, indeed, just variations on the same idea: iterating the Bellman operator in order to converge to the optimal value function. There are many other tweaks we could make to it which, nevertheless, don’t affect the algorithm’s essence: choosing good warm starts, parallelizing, changing the state traversal order in the in-place version, etc. The best approach tends to be problem-dependent.

Policy Iteration

One issue with value iteration is that all policy calculations are implicit, since we just work with value functions. Therefore, it is possible to reach an optimal policy but keep iterating the algorithm because the value function has not converged yet. In this section, we will get acquainted with policy iteration, an algorithm that uses policies to calculate value functions and value functions to calculate policies until it converges towards the optimal. It’s selling point is being able to directly calculate an optimal policy in a finite number of steps.

Policy Evaluation

Let’s say somebody gave you a policy \pi and told you nothing more about it. How can you calculate the value function v^\pi associated with it? One way is to notice that it satisfies a recursion similar to the Bellman equation, but without the minimization step.

\begin{array}{rl} v^\pi(s) = & c(s, a) + \gamma v^\pi(s') \\ & \quad\textrm{where}\; s' = T(s, \pi(s)). \end{array}

We can also transform this equation into a fixed point problem by defining an operator

\begin{array}{rl} (\mathcal{B}^\pi v)(s) = & c(s, a) + \gamma v(s') \\ & \quad\textrm{where}\; s' = T(s, \pi(s)). \end{array}

The above can be readily written as a computational procedure:

function policy_bellman_operator(prob :: Process, pi :: Policy)
  return function(v)
    Bv = Values{States}()  # Empty representation
    for s in States
      Bv[s] = total_cost(prob)(v, s, pi[s])
    end
    return Bv
  end
end

Now, we can look at \mathcal{B}^\pi as the cost for a decision process where the action set for each state was collapsed into a single option. Hence, we know that, under the same assumptions as before, it has a unique fixed point.

Moreover, turning it into an update procedure converges towards v^\pi for any initial value function. This way, we arrive at an algorithm for evaluating the cost of a policy, unimaginatively called policy evaluation.

function policy_evaluation(prob :: Process
                          , π   :: Policy
                          ; v0 = zeros(States)
                          , tol)
  return fixed_point(policy_bellman_operator(prob, pi); v0, tol)
end

Notice the similarity with value iteration. The only true difference is on which operator we pass to the fixed point: instead of choosing an optimal action, it just follows along with the policy. Moreover, all the discussion above on the variations of value iteration also holds for policy evaluation. You can do the same modifications with same effects.

Policy Improvement

After we know a policy’s value function, our next question is how to update it into a better policy. That is, how can we use this information to get nearer to the optimal.

Remember from the previous discussions that applying the Bellman operator \mathcal{B} to any non-optimal value function produces a strictly better result. It can, therefore, improve a policy’s value function.

(\mathcal{B}v^\pi)(s) = \min_{a \in \mathcal{A}(s)} c(s, a) + v^\pi(T(s, a)) \le c(s, \pi(s)) + v^\pi(T(s, \pi(s))) = v^\pi(s).

The value function \mathcal{B}v^\pi encodes the cost of choosing the best action right now while following \pi on all future steps. We can get a policy from it by taking the solution to the optimization problem.

\begin{array}{rl} \pi'(s) = \argmin\limits_{a} & c(s, a) + \gamma v^\pi(s') \\ \textrm{s.t.} & s' = T(s, a), \\ & a \in \mathcal{A}(s). \end{array}

Since, unless \pi was optimal, the equation above generates a strictly better policy. From it we can define a procedure, called policy_improvement, which turns a value function v into a policy \pi that is better than whatever policy v represented.

function policy_improvement(prob :: Process, v :: Values)
  π = Policy{States, Actions}()
  for s in States
    π[s] = argmin(a -> total_cost(v_π, s, a), Actions(s))
  end
  return π
end

Alternating Evaluation and Improvement

By starting from any random policy \pi_0, and alternatively running policy evaluation and improvement, we generate a sequence of policies and value functions

\pi_0 \xrightarrow{\textrm{evaluation}} v^{\pi_0} \xrightarrow{\textrm{improvement}} \pi_1 \xrightarrow{\textrm{evaluation}} v^{\pi_1} \xrightarrow{\textrm{improvement}} \ldots

The value functions are monotonically decreasing while the policies strictly improve until converging to the optimum.

We thus arrive at another dynamic programming algorithm: policy iteration. It consists of iteratively taking a policy \pi, finding its value function v^\pi through policy evaluation and finally using policy improvement to arrive at a better policy. Since there are only finitely many policies, and we always obtain a strictly better policy, this algorithm is guaranteed to converge to an optimal policy in a finite amount of steps.

function policy_iteration(prob :: Process
                         ; v0 = zeros(States)
                         , π0 = rand(Policy{States, Actions)
                         , tol)
  v  = policy_evaluation(prob,  π_0 ; v0 = v_0, tol = tol)
  π = policy_improvement(prob, v)

  while π != π0
    π0 = π
    # Use previous v as warm start
    v  = policy_evaluation(prob,  π ; v0 = v, tol = tol)
    π  = policy_improvement(prob, v)
  end
  return π, v
end

Just like value iteration, policy iteration also accepts many variations on how we traverse the states. The implementation above is close to the theory and is embarrassingly parallel on both the evaluation and the improvement step. Nevertheless, it is useful to think of policy iteration more as an algorithmic principle than as an algorithm itself and adapt the steps to consider any problem specific information that may be available.

Backward Induction over a Finite Horizon

Until now, we’ve always worked with general decision processes that run indefinitely. Nevertheless, there are processes with more structure than that. Let’s look further into an instance where we can exploit the problem’s state space structure to tune these algorithms into much faster ones. In this section we deal with finite horizon problems and show that, for them, we can make value iteration converge in a single iteration!

A state \blacksquare in a decision process is terminal if all actions one can take on it have zero cost and do not transition to another state.

\forall a \in \mathcal{A}(s),\,c(\blacksquare, a) = 0\,\text{ and }\, T(\blacksquare, a) = \blacksquare.

The dynamics effectively ends whenever the process reachs a terminal state. Essentially, there is work to do until arriving at \blacksquare, where we can just relax and do nothing for the rest of eternity. Hence its name. Also, since all its actions are dull, it is customary to draw it without arrows.

When it is guaranteed that, for any policy, the dynamics will reach a terminal state after a finite number of steps, we say that it has a finite horizon. In this case, the state machine underlying graph is acyclic. That is, any trajectory visits a non-terminal state at most once.

The Backwards Induction Algorithm

We can improve the algorithms for Dynamic Programming with a clever traversal of the states. On all algorithms we’ve seem, there was no assumption on which order we traversed the state space at each iteration — We even noticed that many of them could run in parallel — and, consequently, there where also no guarantees arising from this order. Nevertheless, in the in-place algorithm distinct orderings produce distinct flows of information.

In a finite horizon problem we can exploit the acyclic structure to get an optimal ordering such that value iteration converges in a single iteration! It is the graph’s topological ordering and consists of sorting natural number labels for the states such that s comes before all states that can transition to it. Although this seems like a hard task, there are standard methods to calculate it linearly in the size of the process, i.e., taking \mathrm{O}(|\mathcal{S}| + |\mathcal{A}|) steps. See below a topological ordering for the previous example.

Notice that it is not unique. Also notice that the earliest indices go to the terminal states and then get distributed “backwards” through the arrows. This is the origin of our next algorithm’s name: Backwards Induction.

Backwards Induction is a variation of Value Iteration using a topological sort to better traverse the state space. It is the same as the in-place algorithm but with exact convergence in a single iteration. What about an improvement!

We can prove by induction that this indeed converges in a single pass. The base case consists of terminal states: their cost is zero and the automaton never leaves them, implying that v^\star(\blacksquare) = (\mathcal{B}v^\star)(\blacksquare) = 0.

On a non-terminal state s, assume by induction that our calculated value function v is optimal for any lower-indexed state. Consequently, since we only transition from s to a state with a lower index, the current estimate v(T(s, a)) is already optimal. Then, the local assignment v(s) \gets (\mathcal{B}v)(s) is well-defined and produces the optimal at this state.

\begin{array}{rl} v(v) \gets (\mathcal{B}v)(s) = \min\limits_{a} & c(s, a) + \gamma v(s') \\ \textrm{s.t.} & s' = T(s, a)\quad \color{red}{\Longleftarrow \text{ already solved}} \\ & a \in \mathcal{A}(s). \end{array}

The previous discussion proves that we can get convergence in a single pass. Now, all that is left is to arrange the proof into an algorithm. It works the same as value iteration except that it first sorts the states and only performs a single pass.

# Value Iteration tuned for a DAG
function backward_induction(p :: Process)
  v  = Values{States}()
  π  = Policy{States, Actions}()

  # Traverse from terminal to initial states
  for s in topological_sort(States, Actions)
    v[s], π[s] = minimize(a -> total_cost(p)(v, s, a), Actions(s))
  end

  return π, v
end

I recommend you to compare this with generic value iteration to see what we’ve gained. One thing should be clear: backward induction does the exact same operation as value iteration for each state, but only requires a single pass over the state space. This makes it much more efficient. I know I’m repeating this phrase a lot, but it is important enough to deserve it.

Another more subtle detail is that since there is more structure to exploit in the dynamics, the algorithm doesn’t have to assume so much about the Bellman operator. For example, our proof of convergence had no dependence on the Banach fixed-point theorem, meaning that we need no assumption whatsoever about the discount factor \gamma. In particular, backwards induction works for undiscounted processes!⁷

Stagewise states

There is an instance of finite horizon processes that is so common that it deserves its own section. It consists of states and actions that progress sequentially over time. In this case, there is a fixed horizon with N steps and the state space is segmented according to the step t.

The state space for such a process is highly structure and, in particular, always acyclic. We can topologically sort it by visiting each cluster backwards in time. The figure below illustrates backwards induction running for this kind of process.

When this stagewise structure is available, we can bake it into the algorithm to spare the trouble of sorting the states.

function backward_induction_in_time(p :: Process)
  v  = Values{States}()
  π  = Policy{States, Actions}()

  for t in N:1           # <-- Equivalent to a topological sort
    for s in States(t)   # <-- Embarassingly parallel
      v[s], π[s] = minimize(a -> total_cost(p)(v, s, a), Actions(s))
    end
  end

  return π, v
end

The above is no more than a specialized version of our previous algorithms. Its computational effort is exactly |\mathcal{S}|\cdot|\mathcal{A}| because it amounts to a single Value Iteration pass with no need for preprocessing. Different from ordinary backwards induction, the above is also embarrassingly parallel on the inner loop, because there are no intra-stage dependencies. You can exploit this to get a further speedup.

Nondeterminism: One Cannot Know It All

Ok, it’s time to recapitulate a bit. We started our journey looking at automata and controllable dynamics, showing how we can represent the best set of actions as those who fulfill a certain recursive relation on their value functions, called the Bellman equation. Then, we explored a couple ways of solving this equation through methods for finding fixed points. A detail that stands out on all of that, thought, is that we always had perfect information about our systems.

In the real world, there are no guarantees that taking an action will put you in a determinate state. In general, there is a myriad of possible states — perhaps all of them — that the system could transition to. Let’s model this by modifying the transition function’s type to

T : (s : \mathcal{S}) \times \mathcal{A}(s) \to M \mathcal{S}

where M converts a type to some uncertain context.⁸ Common examples of such operators are

M is the power set \mathcal{P}. In this case, the transition output enumerates all possible ensuing states, defining a nondeterministic automaton.
M takes sets to probability distributions over them. This way, we have stochastic systems with multiple possible futures. Differently from the previous example, though, in this case there is a notion of how probable is each following state.
M is the identity. This way, M \mathcal{S}= \mathcal{S}. This serves to show that the deterministic case is also encompassed the nondeterministic one.
M represents actions taken by others. In many situations — such as a game — there are other players besides you who are capable of taking actions modifying the states. Since their actions also influence on the outcomes, the transition can only return a function \mathcal{A}(s) \to \mathcal{S}.

To deal with nondeterministic transitions, we need to take a value function and aggregate it over all possible costs for the uncertain future into a single real number,

\rho : (\mathcal{S}\to \mathbb{R}) \times M \mathcal{S}\to \mathbb{R}.

The function \rho depends on which uncertainty context we’re dealing with, and will shortly see some examples of it. Using it, we can define a Bellman operator

\begin{array}{rl} (\mathcal{B}v)(s) = \min\limits_{a} & c(s, a) + \gamma \rho(v, s') \\ \textrm{s.t.} & s' = T(s, a), \\ & a \in \mathcal{A}(s). \end{array}

Quite similar to what we had before, don’t you think? With some mild conditions over \rho, all derivations on the appendix for existence and uniqueness of solutions still work in this context. Moreover, all our algorithms — value iteration, policy iteration, backwards induction — only used the system’s total cost, which is still a deterministic function. Thus we can apply them to this context without any need for modification. How cool is that?

Example: Recurrence Equations and Fibonacci

What would be of a dynamic programming tutorial without calculating the good ol’ Fibonacci numbers, right? This example is actually pretty simple and using the full power we’ve developed is a real overkill. Nevertheless, as it is a rather common first example to encounter when learning dynamic programming in Computer Science classes, it is worth it to view this kind of problem in the light of the formalism we’ve developed.

A recurrence relation is any function \N \to \mathbb{R}, where the nth term is recursively defined through the previous terms.

f(n) = \begin{cases} c_n, & n < k \\ g(n, f(n-1), \ldots, f(n-k)), & n \ge k. \end{cases}

The c_n are constants corresponding to the base case of the recursion while other values f(n) depends directly on the previous k terms.

Famous examples with this structure are the factorial and Fibonacci functions:

\begin{align*} \mathrm{fat}(n) &= \begin{cases} 1, & n = 0 \\ n \cdot \mathrm{fat}(n-1), & n \ge 1. \end{cases} \\ \mathrm{fib}(n) &= \begin{cases} 0, & n = 0 \\ 1, & n = 1 \\ \mathrm{fib}(n-1) + \mathrm{fib}(n-2), & n \ge 2. \end{cases} \\ \end{align*}

In general, evaluating f(n) directly via the definition may be exponentially slow. But dynamic programming allows you to calculate it time that’s linear in n.

The idea is to define a nondeterministic decision process whose Bellman equation is exactly the recurrence relation. Then, the optimal value function will equal f for all values until the N we want to evaluate. The states are the numbers 0,\ldots, N while there is only a single dummy action \blacklozenge. For the transition, we follow the indices recurrence using the power set as our source of nondeterminism. Notice that this is a finite horizon process.

T(s, \blacklozenge) = \begin{cases} \emptyset, & s < k \\ \{s-1, s-2, \ldots, s-k\}, & s \ge k. \end{cases}

For the immediate cost, we use the base cases c_n for the first k stages while the others are all zero.

c(s, \blacklozenge) = \begin{cases} c_s, & s < k \\ 0, & s \ge k. \end{cases}

And to aggregate over the set of future indices, what a better choice than using the relation g itself?

\rho(v, s') = g(s, \{ v(n) \mid n \in s' \}).

With this setup, this system’s Bellman equation looks a lot like the original recurrence.

v(s) = \min_{a \in \{\blacklozenge\}} \begin{cases} c_s, & s < k \\ g(s, v(s-1), \ldots, v(s-k)), & s \ge k. \end{cases}

Since, there is only a single action, the minimization is redundant and its fixed point satisfies the original recurrence. Since the problem at hand has a finite horizon, we can solve it via Value Iteration or Backwards Induction even without a discount factor \gamma.

As an example, let’s calculate the first 15 Fibonacci numbers. The following video shows the steps for value iteration.

Since the horizon is finite, we can improve it even more by using Backwards Induction! The process we’ve just constructed has a single state per stage where we consider the initial state to be N and the final one 0 (which means going backwards, in this case). We can, therefore, use backwards induction to evaluate the Fibonacci equation in exactly n steps.

Stochastic Dynamic Programming

Being a nice applied field, dynamic programming has since its inception kept in mind stochasticity. The real world is full of uncertainty and what’s better to model not knowing the future than probability itself?

To make a automaton stochastic, the transition must return a probability distribution over the states. In the literature, you will find these system with the name Markov Decision Process (MDP for short). This name comes from the fact that the new state only depends on the current state and action, being independent of the process’ history; Just like a Markov chain — which are MDPs with a single action. A usual intuition for this kind of processes is as the interaction between an actor and an environment. At each time step, the environment is at a state s (which is known to the actor), and the actor may choose among different actions a \in \mathcal{A}(s), each one incurring a certain cost, to interact with it. This action affects the environment in some way that is out of reach to the actor (thus stochastic / non-deterministic), changing its state to s' = T(s, a). Before this transition happens, we can just estimate what s' will be with uncertainty. This is illustrated in the diagram below.⁹

Allowing stochasticity opens the way for modeling a lot more cool situations. For example, robots that play video games! The states may be the internal state of the game or some partial observation of them that is available to the player and the actions are the buttons on the controller. The transitions are internal to the game and the costs are related to some winning/losing factor. Have you ever heard of Volodymyr Mnih et al.¹⁰’s Playing Atari with Deep Reinforcement Learning paper? In it, they use reinforcement learning to train a robot capable of playing Atari 2600 games and all modeling is done via Markov decision processes in a really similar way to the discussion in this paragraph. I really recommend checking it out.

Since we cannot know the future, the usual tool is to take all possible outcomes and average their cost. This gives us an aggregating function to extract a deterministic value from a stochastic transition.¹¹

\rho(v, S) = \mathbb{E}[v(S)].

Stochastic dynamic programming is the field dedicated to optimizing the expected total cost of a Markov Decision Processes over all possible policies.

v^\star(s) = \min_\pi v^\pi(s) = \min_{a_t} \mathbb{E}\left[ \sum\limits_{t=0}^\infty \gamma^{t}c(s_t, a_t) \,\middle|\, s_0 = s,\, s_{t+1} = T(s_t, a_t) \right]

As you probably expected, the above is equivalent to a Bellman equation. Since the expectation is linear, the derivation is rather similar to the one we did for deterministic processes.

\begin{array}{rl} v^\star(s) = \min\limits_{a} & c(s, a) + \gamma \mathbb{E}\left[v^\star(s') \right] \\ \textrm{s.t.} & s' = T(s, a), \\ & a \in \mathcal{A}(s). \end{array}

For finite state spaces, this equation has the same contraction properties as the deterministic one. Hence, the tools of value and policy iteration are also readily available to solve MDPs. Their computational complexity is worse, because calculating the mean makes the update rule v \gets \mathcal{B}v require \mathrm{O}(|\mathcal{S}|^2|\cdot|\mathcal{A}|) operations.

End of our Journey

Well, we finally reached the end of our overview of dynamic programming. I sincerely hope it was as fun for you to read as it was for me to write. And that DP gets the honor place it deserves in your problem solving toolkit!¹²

Of course, a single blog post is too tinny to encompass a subject as vast as DP. There are matters about estimating the value function instead of simply calculating it, infinite state spaces, continuous time and a plethora of cool stuff we can do. There are also a lot of connections with reinforcement learning, for which we only scraped the surface in this post. Unfortunately, these will remain as stories for another night.

Farewell and see you next time!

Acknowledgements

This post come to life after a series of conversations I had with Pedro Xavier. The good thing of explaining something to a smart person is that you end up learning a lot in the process. Sometimes you even learn enough to write a blog post about it.

I’m also in debt with Ivani Ivanova for being such a great typo hunter. If there is any typo left, it is because I’m lazy… She did a marvelous job.

Appendix (Convergence in Infinite Horizon)

In this appendix we show that the Bellman Operator

\begin{array}{rl} (\mathcal{B}v)(s) = \min\limits_{a} & c(s, a) + \gamma v(s') \\ \textrm{s.t.} & s' = T(s, a), \\ & a \in \mathcal{A}(s) \end{array}

is a monotone contraction over the space of bounded continuous functions.

We begin our proof with monotonicity. For that, let’s introduce a partial order on the space of value functions \mathcal{S}\to \mathbb{R} given by uniform ordering on all states,

v \le w \iff \forall s \in \mathcal{S},\, v(s) \le w(s).

The Bellman Operator preserves uniform ordering of value functions:

v \le w \implies \mathcal{B}v \le \mathcal{B}w.

The hypothesis v \le w implies for any state s and action a that

c(s, a) + \gamma v(T(s,a)) \le c(s, a) + \gamma w(T(s,a)).

Since this is valid for any a, taking the minimum on both sides preserves the inequality.

\min_{a \in \mathcal{A}(s)} c(s, a) + v(T(s,a)) \le \min_{a \in \mathcal{A}(s)} c(s, a) + w(T(s, a)) \\ (\mathcal{B}v)(s) \le (\mathcal{B}w)(s).

The line above is valid for all states, concluding the proof.

Another important property of \mathcal{B} is that uniform translations of the input v also translate the output uniformly.

For any constant k,\,\mathcal{B}(v + k) = \mathcal{B}v + \gamma k.

\begin{array}{rlll} \mathcal{B}(v + k)(s) &= &\min\limits_{a} & c(s, a) + \gamma (v(s') + k) \\ &&\textrm{s.t.} & s' = T(s, a), \\ && & a \in \mathcal{A}(s) \\ &=& \min\limits_{a} & c(s, a) + \gamma v(s') + \gamma k\\ &&\textrm{s.t.} & s' = T(s, a), \\ && & a \in \mathcal{A}(s). \end{array}

Since the term \gamma k does not depend on the action a, we may take it out of the optimization,

\mathcal{B}(v + k)(s) = \mathcal{B}(v)(s) + \gamma k.

This concludes the theorem.

Finally, let’s prove that the Bellman operator contracts the space of bounded continuous functions by the discount factor.

The Bellman Operator is continuous with Lipschitz constant \gamma, \|\mathcal{B}v - \mathcal{B}w\|_\infty \le \gamma \|v - w\|_\infty. When \gamma < 1, it is a contraction.

From the definition of the uniform norm, we get that for any state s,

v(s) - w(s) \le \|v - w\|_\infty \\ v(s) \le w(s) + \|v - w\|_\infty.

From the monotonicity we just proved, applying \mathcal{B} to both sides preserves this inequality:

(\mathcal{B}v)(s) \le \mathcal{B}(w + \|v - w\|_\infty)(s).

And since the right-hand side above has a uniform translation, we can take the constant out:

\begin{aligned} (\mathcal{B}v)(s) &\le (\mathcal{B}w)(s) + \gamma \|v - w\|_\infty \\ (\mathcal{B}v)(s) - (\mathcal{B}w)(s) &\le \gamma \|v - w\|_\infty. \end{aligned}

Using that the norm is symmetric, we can do the same derivation in the opposite direction (for w - v) to get an inequality for the absolute value. Taking the supremum, it becomes the result we want.

\begin{aligned} |(\mathcal{B}v)(s) - (\mathcal{B}w)(s)| &\le \gamma \|v - w\|_\infty \\ \sup_{s\in\mathcal{S}} |(\mathcal{B}v)(s) - (\mathcal{B}w)(s)| &\le \gamma \|v - w\|_\infty \\ \|\mathcal{B}v - \mathcal{B}w\|_\infty &\le \gamma \|v - w\|_\infty. \end{aligned}

Finally, from the Banach fixed point theorem and the above, we conclude that whenever \gamma < 1, the operator \mathcal{B} has a unique fixed point. Hence, any decision process with a discount factor is solvable and has a unique optimal value function v^\star.

To be more precise, we work with hydrothermal dispatch problems, where one must decide between many sources of energy (hydro, thermal, renewable) to supply a certain power demand taking into account the uncertainties of the future. For example: hydro is cheap and clean, but you risk running out of water if you use all of it and next month turns out particularly dry. Finding the best energy dispatch is once again solved via dynamic programming.↩︎
Even Richard Bellman admittedly named it based on how cool it sounds.↩︎
Dynamic Programming, Princeton Landmarks in Mathematics and Physics (Princeton, NJ: Princeton University Press, 1957), ch 3, p. 83.↩︎
It is out of scope more because it is a tangent to the topic than because of any difficulty in the proof. If you are interested in analysis, I really recommend you to try proving it. The main idea is using the contraction property to show that the distance between the iterations of f must converge towards zero.↩︎
They are even called functions in most programming languages.↩︎
You may wonder if this means that dynamic programming also works for decision processes whose costs satisfy other kinds of Bellman equations (non-additive costs, for example) and the answer is yes! A great place to start is Dimitri P. Bertsekas, Abstract Dynamic Programming (Belmont, Mass: Athena Scientific, 2013)’s book. Just be warned that there are a lot more technicalities to deal with in those cases.↩︎
In contrast with value iteration, it also has no dependence on the costs being real numbers — any Semiring would do. But I digress… This is out of scope for this post.↩︎
The technical definition is that M should be a Monad. But discussing those details is out of scope here. See this other post on automata with contexts for a discussion of this structure in a similar context.↩︎
Loosely based on the diagram found in Richard S. Sutton and Andrew G. Barto, Reinforcement Learning: An Introduction, Second edition, Adaptive Computation and Machine Learning Series (Cambridge, Massachusetts: The MIT Press, 2018).↩︎
“Playing Atari with Deep Reinforcement Learning,” December 2013, https://doi.org/10.48550/ARXIV.1312.5602.↩︎
Although the average is the most common choice, it is far from the only one. Many real-life situations call for risk-aversion and dynamic programming with coherent risk measures works equally well.↩︎
In fact, fixed points and recursion as a whole deserve the spot. They’re everywhere!↩︎