An optimal policy has the property that whatever the initial state and initial decision are, the remaining decisions must constitute an optimal policy with regard to the state resulting from the first decision.

I gotta say that I was taught dynamic programming in different
contexts, but it took me a while to finally get the “click” that they
were actually the same thing. When learning algorithms and data
structures, it was a memoization-based technique where you speed up your
algorithm by first solving the easier parts and saving them for later
use. At work, I mostly deal with solving a lot of linear programs for
long-term scheduling problems.^{1} The main algorithm we
use, called *Stochastic Dual Dynamic Programming*, at first
didn’t seem so much like the programming technique algorithms class.
Finally, one of the main methods for model-based reinforcement learning
is again called dynamic programming, and it also didn’t seem so much
like the other instances.

So, what’s happening here? Did everybody choose to call their
algorithms dynamic programming just because it’s a cool name?^{2} Well, in fact there are some
principles that apply to all of those instances, from planning a
rocket’s trajectory to TeX’s word-wrapping. And the list
goes on and on.

I want to invite you to a journey through many realms of mathematics. We will range from automata to optimal control, passing through Markov chains, dynamical systems, linear programming and even metric spaces. Take your seat and enjoy the ride!

Before delving into dynamic programming per se, we first have to establish a few concepts. After all, it’s always best to know which problems you intend to solve before learning a method to solve them, right?

As a matter of motivation, let’s start with something I am really fond of: old school platformer games. In our hypothetical game which is definitely not about some Italian plumber, the character does stands idle doing nothing by default. But with the press of a button in the controller, the player may command the character to do a few things: shoot, jump or walk. And, of course, each of these actions activate the respective animation on the screen. In the best Resident Evil style, this game only allows a character to shoot while idle and also forces you to first be idle after a jump before doing any other action. Think of that as the time it takes to restore one’s balance after falling. This description may seem overly complicated on text, but fortunately the nice folks in the Comp Sci department already invented diagrams that show these transitions nicely.

```
digraph "Plataformer State Machine" {
rankdir=LR;
size="8,5"
node [shape = circle
style = "solid,filled"
width = 0.7
color = black
fixedsize = shape
fillcolor = "#B3FFB3"];
A [label = "shooting"];
J [label = "jumping"];
W [label = "walking"];
I [label = "idle"];
I -> I [label = "do nothing"];
W -> W [label = "keep walking"];
I -> A [label = "attack"];
A -> I [label = "finish attack"];
I -> W [label = "walk"];
J -> I [label = "hit ground"];
W -> I [label = "stop"];
W -> J [label = "jump"];
I -> J [label = "jump"];
}
```

Our modeling above is an instance of something called a *state
machines* or *automata* if you’re into Greek words. There are
4 states in which the vampire can be and at each one there is an
available set of actions to take that may transition him to another
state. More abstractly, an automaton is a system that can be in one of
many *states* *s* ∈ 𝒮 and
at each state, you can choose among a set of *actions* *a* ∈ 𝒜(*s*) that transition
the system to a new state *s*′ = *T*(*s*,*a*),
where *T* is called the system’s
*transition function*.

An important notice: If you come from Computer Science, you are
probably most used to *finite* state machines. Just know that in
our context, the states may be any set. Some of the algorithms that we
will see today only work for finite state spaces, but there are others
that may even require a continuous space! An example is SDDP, which uses
linear programming duality and thus requires the state space to be a
convex subset of ℝ^{n}.

Iterating the transition *T*
establishes a dynamics for our system where we start at an initial state
*s* and by taking a sequence of
actions *a*_{1}, *a*_{2}, …
we walk over the state space.

$$ \begin{aligned} s_1 &= s, \\ s_{t+1} &= T(s_t, a_t). \end{aligned} $$

This new view makes our state machine somewhat equivalent to a controllable dynamical system, which is another really cool name in my opinion.

As an example, think of a game of Sudoku. The states are the (partially numbered) boards and the actions consist of putting a valid number in a certain coordinate. You start with some random board and repeatedly place numbers until you reach a terminal state where there are no available actions.

One can argue that a state encapsulates all you must know about your
system in order to choose an action, no matter the previous history nor
time step. Indeed, if those things affect your choice, then you can
without loss of generality model the problem as a larger system where
the state also carries this information. Thus controlling a dynamic
system amounts to a function *π* : (*s*:𝒮) → 𝒜(*s*)
which chooses a valid action for each state. In the literature this
called a *policy*, in analogy to a government taking actions to
run its state.

Unfortunately life is not known for its free lunches and in most
systems, whenever we take action *a* at state *s*, there is a certain cost *c*(*s*,*a*) to pay.
Depending on the context this can be, for example, a real monetary cost
(for economic problems), some total distance (for planning) or even a
negative cost representing a reward.

Thus, by following a policy *π*, we produce a sequence of cost
*c*(*s*_{t},*π*(*s*_{t}))
for each time step. We could define the total cost for *π* as the sum of those costs, but
there is an additional detail to notice. If I gave you something and
asked whether you want to pay me today or next year, which option would
you prefer? Sometimes there are factors such as inflation or interests
that make costs in the future not have the same actual value as the
costs we expend right now. This prompts us to introduce a problem
dependent *discount factor* *γ* ∈ [0,1] such that the total cost
for a policy *π* is

$$ \begin{array}{rl} v^\pi(s) = & c(s_1, \pi(s_1)) + \gamma c(s_2, \pi(s_2)) + \gamma^2 c(s_3, \pi(s_3)) + \ldots \\ \textrm{where} & s_1 = s, \\ & s_{t+1} = T(s_t, \pi(s_t)), \\ \end{array} $$

The equation above defines the *value function* *v*^{π} : 𝒮 → ℝ for a
given policy *π*.
**Spoiler**: keep an eye on the *v*^{π}, because
later in this post we will find them to be useful tools closely related
to the memoization techniques that people usually identify with dynamic
programming.

Besides its practical interpretation, the discount factor *γ* also plays a significant role from
the mathematical point of view. If |*γ*| < 1 and the costs are
uniformly bounded (which is the case for a finite action space, for
example) we can guarantee that the series defining *v*^{π} converges for
any policy and initial state. That is, suppose that exists *M* > 0 such that

∀*s* ∈ 𝒮, *a* ∈ 𝒜(*s*), |*c*(*s*,*a*)| ≤ *M*.

This bounds the total cost by a geometric series that cannot blow up,

$$ \sum\limits_{t=1}^\infty \gamma^{t-1}|c(s_t, \pi(s_t))| \le \sum\limits_{t=1}^\infty \gamma^{t-1} M \le \frac{M}{1 - \gamma}, $$

thus guaranteeing that the value function is well-defined.

Having multiple courses of action possible prompts us to ask which is
the best one possible. When programming a robot to escape a labyrinth,
you want it to take the least amount of time. When controlling a
spaceship towards the moon, it is important to guarantee that it will
use the least amount of fuel. When brawling at a bar, you want to knock
out your foe with the least injuries possible. Must of all, the best
policy is the one with the least cost taking *all time* into
account; both the present and its future consequences. For example,
sometimes a policy that has a higher cost for the first state is overall
better because it puts us into a more favorable state. Thus, our problem
can be naturally formulated as searching for *optimal
policies*:

Starting at state

s, find a policyπproducing the least total cost over time.

Or equivalently in math language:

$$ \begin{array}{rl} \min\limits_\pi v^\pi(s) = \min\limits_{a_t} & \sum\limits_{t=1}^\infty \gamma^{t-1}c(s_t, a_t) \\ \textrm{s.t.} & s_1 = s, \\ & s_{t+1} = T(s_t, a_t), \\ & a_t \in \mathcal{A}(s_t). \end{array} $$

Right now, this may seem like a big and scary optimization problem but in fact it contains a lot of structure that we can exploit in order to solve it. This will be the subject of the next section. Before we continue, let’s go over a little tangent on how to formulate some classical problems in this decision making framework.

Suppose you are at your hometown and just received a message from a friend telling you that there are singing llamas in Cuzco, Peru, right now. This makes you at the same time incredulous and curious, so you just pick your favorite bike and get on the road towards Cuzco. Unfortunately there are no direct bikeways connecting your home to Cuzco, meaning that you will have to find a route going through other cities. Also, there is a risk that the llamas will stop to sing at any time and just go back to their usual behavior of eating grass throughout the mountains. This prompts you to decide to take the shortest possible path to Cuzco.

The above description is an instance of finding the shortest path in a graph. In it, we represent each city by a graph node and direct routes between two cities as a weighted edge where the weight is the distance. Going from home to Cuzco amounts to finding the path between those two nodes with the smallest total distance.

The translation from this graph description to a decision process description is quite straightforward.

**States**: nodes in the graph.**Actions**at state*s*: edges going from*s*to another node.**Transition**: The opposite node on the same edge. That is, given an edge*s*→*s*′,*T*(*s*,*s*→*s*′) =*s*′.**Costs**:*c*(*s*,*a*) is the weight of edge*a*.

Finding the shortest path from *s* to node *z* is the same as setting the initial
state to *s* and making *z* a terminal state of our
dynamics.

Alright, it’s finally time to solve those decision problems. The simplest idea could be to exhaustively search the space of all actions trying to find the best solution. Notice that even for finite states and horizon, this may be prohibitively expensive since the possible candidates grow exponentially with the time steps. Any practical method will take into account how this class of problems naturally breaks apart into separate stages.

Our approach will involve the famous *Bellman principle of
optimality*, which is the cornerstone of dynamic programming. Taking
@bellmanDPBook [ch 3,
p. 83]’s own words, it reads as:

An optimal policy has the property that whatever the initial state and initial decision are, the remaining decisions must constitute an optimal policy with regard to the state resulting from the first decision.

Alright, what does this mean? What the principle of optimality is
telling us is that in order to calculate an optimal policy, we should
turn this iterative process of making actions and calculating costs into
a recursive procedure. That is, taking an action puts us into a new
state *s*_{2} = *T*(*s*_{1},*a*_{1})
where we are again faced with the exact same problem of finding an
optimal policy but this time starting at *s*_{2}. Let’s see how we can
exploit this idea.

Remember that we defined the value function *v*^{π} as the total
cost of following a policy *π*
when starting at a given state. Let’s define the *optimal value
function* *v*^{⋆} as
the total cost of choosing the best course of action while starting at a
certain state *s*.

$$ \begin{array}{rl} v^\star(s) = \min\limits_{a_t} & \sum\limits_{t=1}^\infty \gamma^{t-1}c(s_t, a_t) \\ \textrm{s.t.} & s_1 = s, \\ & s_{t+1} = T(s_t, a_t), \\ & a_t \in \mathcal{A}(s_t). \end{array} $$

Notice in the optimization problem above that the initial state is
only ever used to choose the first action. Later actions do not depend
directly on it but only on its consequences. This means that we can
break the problem into two parts: calculating an *immediate cost*
dependent on the initial state and calculating a future cost dependent
on all next states.

$$ \begin{array}{rl} v^\star(s) = \min\limits_{a,a_t} & c(s, a) + \left( \begin{array}{rl} \min\limits_{a_t} & \sum\limits_{t=2}^\infty \gamma^{t-1}c(s_t, a_t) \\ \textrm{s.t.} & s_2 = s', \\ & s_{t+1} = T(s_t, a_t), \\ & a_t \in \mathcal{A}(s_t) \end{array} \right) \\ \textrm{s.t.} & s' = T(s, a), \\ & a \in \mathcal{A}(s). \end{array} $$

There’s already some recursive structure unfolding in here! What is
still missing consists of noticing that since the sum in the future cost
starts at *t* = 2, we can factor
out *γ*. By renaming *l* = *t* − 1 we get

$$ \sum\limits_{t=2}^\infty \gamma^{t-1}c(s_t, a_t) = \gamma \sum\limits_{t=2}^\infty \gamma^{t-2}c(s_t, a_t) = \gamma \sum\limits_{l=1}^\infty \gamma^{l-1}c(s_l, a_l), $$

and applying this in the expression for *v*^{⋆},

$$ \begin{array}{rl} v^\star(s) = \min\limits_{a} & c(s, a) + \gamma\left( \begin{array}{rl} \min\limits_{a_l} & \sum\limits_{l=1}^\infty \gamma^{l-1}c(s_l, a_l) \\ \textrm{s.t.} & s_1 = s', \\ & s_{l+1} = T(s_l, a_l), \\ & a_l \in \mathcal{A}(s_l) \end{array} \right) \\ \textrm{s.t.} & s' = T(s, a), \\ & a \in \mathcal{A}(s). \end{array} $$

Although this is a huge expression, it should be straightforward to
see that the expression for the future cost is *exactly* the
optimal value *v*^{⋆}(*s*′) of
starting the dynamics at *s*′ = *T*(*s*,*a*).
This way, the principle of optimality express itself mathematically as a
recursive equation that the value for an optimal policy must
satisfy.

$$ \begin{array}{rl} v^\star(s) = \min\limits_{a} & c(s, a) + \gamma v^\star(s') \\ \textrm{s.t.} & s' = T(s, a), \\ & a \in \mathcal{A}(s). \end{array} $$

This is called the *Bellman equation* and all dynamic
programming consists of methods for solving it. Even more: we can think
of the Bellman equation as a recursive specification for the decision
problems and of dynamic programming as any problem-specific
implementation that solves it.

It is time to get deeper into analysis. Whenever mathematicians see a
recursive relation such as the Bellman equation, they immediately start
asking such things: what guarantees do we have about *v*^{⋆}? Can I trust that it
is unique? Does it even exist? Surely we mathematicians may seem a bit
too anxious with all these questions, but they are for good reasons.
Besides the guarantee that everything works, proving the existence of a
solution in many cases also teaches us how to construct this solution.
In fact, this will be just the case! In the next section we are going to
adapt the theorems in here into algorithms to solve the Bellman
equation.

Recursion has a deep relationship with fixed points. For example, we
can adapt the Bellman equation as the fixed point of an operator ℬ : (𝒮→ℝ) → (𝒮→ℝ) called, you can guess, the
*Bellman Operator*. It takes value functions to value functions
and is defined as

$$ \begin{array}{rl} (\mathcal{B}v)(s) = \min\limits_{a} & c(s, a) + \gamma v(s') \\ \textrm{s.t.} & s' = T(s, a), \\ & a \in \mathcal{A}(s). \end{array} $$

If you are not accustomed with fixed points, the transition above from the Bellman equation to operator may seem strange. Hence, let’s think about a little story in order to develop some intuition.

Imagine that you are the king/queen of fantastical kingdom. You are an absolute monarch and whatever action you decide to take, your subjects will follow. Lately, the kingdom’s reserves are running dry and your counselors advised you to define a clear governing policy in order to minimize the kingdom’s spendings. Since besides a ruthless ruler you’re also a great mathematician and a fan of my blog, at this point you already know what most be done to save your kingdom from bankruptcy: solve the Bellman equation in order to find its optimal policy.

Because at this point of the post you already don’t know how to solve it, you decide to take advantage of your fantastical world and hire a wizard to look into his crystal ball and act as an oracle telling you how much each possible state will cost to the kingdom in the future. Beware though, that it is not wise to blindly follow advices from a crystal ball. The right thing to do in this situation is to use this oracle to decide on what to do at each state. Now, instead of solving the Bellman equation, all you have to do to get a policy is to solve the optimization problem with future cost given by the wizard’s predictions. The processes of going from prediction to decision is precisely the Bellman operator. The function $v% is the value function of the policy that chooses the best action according to the prediction.

The optimal value function, furthermore, is the one with the correct prediction, and thus is kept unchanged by applying the Bellman operator:

ℬ*v*^{⋆} = *v*^{⋆}.

We thus reduce the question of existence and uniqueness of solutions
for the Bellman equation to finding fixed points of ℬ. Fortunately there is a theorem that does
just that, called the *Banach Fixed Point* theorem!

Let (*M*,d) be a complete
metric space and *f* : *M* → *M* a
continuous function such there is a *γ* ∈ (0,1) satisfying

d(*f*(*x*),*f*(*y*)) ≤ *γ*d(*x*,*y*)

for any *x*, *y* ∈ *M*. Then
*f* has a unique fixed point
*x*^{⋆} and for any
initial value *x*_{0},
iterating *f* will eventually
converge towards *x*^{⋆},

lim_{n → ∞}*f*^{n}(*x*_{0}) = *x*^{⋆}.

Proving this theorem is out of scope for this post^{3}.
However, we can think of it as saying that if a mapping shrinks all
distances, then eventually the image of all points will be squeezed into
a single point.

To apply the Banach fixed point theorem on the Bellman operator, we
need to find a metric space where ℬ is
a contraction. The suitable space are the bounded continuous functions
over the states, *C*_{b}^{0}(𝒮,ℝ),
imbued with the uniform norm

∥*f*∥_{∞} = sup_{s ∈ 𝒮}|*f*(*s*)|.

This is a complete metric space and is general enough for any
practical problem we may think of. Furthermore, recall that if the state
space is discrete, then *any function is continuous and bounded*,
hint that this encompasses the most important spaces from a
computational perspective.

Since I don’t want to depart too much from this post’s main topic nor dive into mathematical minutiae,, this section only enunciates the important theorem. In case you are interested, the necessary proofs are in an appendix for completeness.

The Bellman operator is a continuous contraction whenever the
discount factor is *γ* < 1.
Thus it has a unique fixed point and for any value function v_0,

ℬ^{(n)}*v*_{0} → *v*^{⋆}.

Moreover, if *v*_{n} is the *n*-th iterate, we can estimate its
distance to the optimum by

$$ \mathrm{d}(v_n, v^*) \le \frac{γ^n}{1 - γ} \mathrm{d}(v_0, \mathcal{B}v_0).$$

Besides the existence and uniqueness guarantees, this theorem also spells something with a more practical character. No matter with which cost estimative we start, we can use the Bellman operator to update our value function until it converges towards the optimal. This is the next section’s topic.

For this section, let’s assume that both the state 𝒮 and action 𝒜(*s*) spaces are finite. This will
allow us to focus on exhaustive methods exploring the entire state
space. Keep calm however, I am planning to write other posts in the
future to explain how these ideas generalize to continuous spaces, or
even to finite spaces that are too huge to explore entirely, via
something called Approximate Dynamic Programming or Reinforcement
Learning. But this is a story for another night…

From the previous discussion, we learned that iterating the Bellman
operator convergences towards the optimal the optimal value function.
Thus, we arrive at our first algorithm: *value iteration*. Its
main idea is actually quite simple: to convert the Bellman equation into
an update rule.

*v* ← ℬ*v*.

We can thus start with an initial value function *v*_{0} and iterate the update
rule above. By the magic of the Banach Fixed Point theorem, this will
converge towards the optimal value function no matter what the initial
value function is. This procedure repeats until the uniform error ∥*v* − ℬ*v*∥_{∞}
becomes less than a previously set tolerance (for which we have an
estimate of necessary iterations).

By efficiency reasons, it is customary to represent the value function not as a function but using some data structure (usually a vector). The memoization that people associate with dynamic programming lies entirely in this “trick”. However, it is good to keep in mind that this is only a matter of computational representation of functions and is totally orthogonal to any algorithm’s design. In a language with first-class functions, it is possible to do dynamic programming using only function composition. It is just possible that it will not be as fast as one would like.

In value iteration, we obtain an optimal policy from the value function by keeping track of the $\argmin$ whenever we solve an optimization problem. Below we see a Julia implementation of value iteration.

```
function value_iteration(v0 ; tol)
= copy(v0)
v π = Policy{States, Actions}()
= Inf
maxerr while maxerr > tol
= 0
maxerr for s in States
= v[s]
prev π[s] = findmin(a -> c(s, a) + γ*v[T(s,a)], Actions(s))
v[s], = max(maxerr, abs(v[s] - prev))
maxerr end
end
return π, v
end
```

In the animation below, we can see value iteration in action for the problem of escaping from a maze. In this model, each state is a cell in the grid and the actions are the directions one can take at that cell (neighbours without a wall). The objective is to reach the right-bottom edge in the minimum amount of steps possible. We do so by starting with a uniformly zero value and a random policy. In the left we see the value function at each iteration and in the right the associated policy.

The algorithm above is in fact just one variation of value iteration.
There are still many problem-dependent improvements one can make. For
example, we chose to update *v*
in-place, already propagating the new value function while traversing
the states, but we could instead have kept the old value function and
only updated *v* after
traversing all states. Our approach has the advantage of using the
improved information as soon as it is available but updating in batch
may be interesting when we’re able to broadcast the optimization across
many processes in parallel.

Other important choice we have is the initial value function.
Choosing a good warm start can greatly improve the convergence. As an
example, whenever there is a terminal state ◼, it is a good idea to already fill *v*(◼) = 0. Finally, the order that we
traverse the states matter. There is a reason why dynamic programming is
famous for solving problems backwards. If we know that a given state is
easier to solve, we should start the traverse by it. A specialization
that we will further explore in a bit.

Well, we have a lot of options… Nevertheless, as long as we keep visiting all states, any of those approaches is guaranteed to converge towards the optimal value function, which one is faster being generally problem-dependent. This is why I think it is best to think of DP not as an algorithm but as a principle that encompasses many similar algorithms.

Let’s take look at a typical scenario where we can exploit the state
space structure to make value iteration much faster: problems with a
*fixed finite horizon*.

When the dynamics ends at a certain number *N* of time steps, we say that the
problem has a finite horizon. In this case, the stage *t* is part of the state variable,
(Because how else would we know when to stop?) and there is a
*terminal state* ◼ with zero
cost and representing anything that happens after the dynamics end.
Essentially, we work for *N*
stages and then reach ◼, where we can
just relax and do nothing for the rest of eternity.

```
digraph "State over Time" {
rankdir=LR;
size="8,5"
T [label = "" width=0.2 style=filled, color = black, fillcolor = black, shape = square];
node [shape = circle
style = "solid,filled"
width = 0.7
color = black
fixedsize = shape
fillcolor = "#B3FFB3"
label = ""];
subgraph cluster_1 {
rank = same;
label="t = 1";
s;
}
subgraph cluster_2 {
rank = same;
label="t = 2";
a2; b2;
}
subgraph cluster_3 {
rank = same;
label="t = 3";
a3; b3;
}
subgraph cluster_N {
rank = same;
label="t = N";
a4; b4;
}
subgraph cluster_ldots {
rank = same;
style = invis;
k [fontsize=20 color = "#00000000" fillcolor= "#00000000" label = ". . ."];
}
s -> {a2 b2} -> {a3 b3} -> k -> {a4 b4} -> T;
T:e -> T:e [constraint=false];
}
```

If you prefer equations over figures, a finite horizon state machine is one where the transition function looks like this:

$$ \bar{T}((t, s), a) = \begin{cases} (t + 1, T(s, a)), & t < N \\ \blacksquare, & t = N. \end{cases} $$

But what is so special about finite horizons? After all, the equation
above seems much more confuse than what we had before. Well… what we
gain is that the state space 𝒮 is
clustered into smaller spaces that are visited in sequence: 𝒮_{1}, 𝒮_{2}, …, 𝒮_{N}, 𝒮_{N + 1} = {◼}.

The *backward induction* algorithm consists of value iteration
but exploiting the structure we just discussed. At the terminal state,
the cost is always zero, so we can set *v*(◼) = 0. But this, means that the
*future cost* for all states in 𝒮_{N} is fixed, and the
optimal policy for them is just to choose the cheapest action. Now that
we know which action to choose in 𝒮_{N}, the future is fixed
for all states in 𝒮_{N − 1}, reducing the
problem in them to the routine we just did. We repeat this procedure
until we reach the first stage. This calculates an optimal policy by
induction, but moving “backwards” in time.

```
digraph "Backward Induction" {
size = "8,5";
center = true;
nodesep = 0.5;
{ rank=sink;
node [label = "", width=0.2 style=filled, color = black, fillcolor = black, shape = square, margin=1.5];
T [group = infty];
};
{ rank=source;
node [fontsize=20 color = "#00000000" fillcolor= "#00000000" label = ". . ."]
k [group = infty];
};
node [shape = circle
style = "solid,filled"
width = 0.7
color = black
fixedsize = shape
fillcolor = "#B3FFB3"
label = ""];
subgraph last_stage {
rank = same;
node [group = me];
A; B; Z [width = 0.1,group = infty, style=invis]; C; D;
}
// Alignment edges
A -> B -> Z -> C -> D [style=invis];
T -> Z -> k [style=invis];
// Real edges
k -> {A B C D};
edge[minlen=2];
{D} -> T;
{A B C D} -> T;
{D B A C} -> T [color = red style = bold];
{B} -> T;
}
```

In Julia code, the algorithm looks like this:

```
function backward_induction()
= Values{States}()
v π = Policy{States, Actions}()
for t in N:1
for s in States(t)
π[s] = findmin(a -> c(s, a) + γ*v[T(s,a)], Actions(s))
v[s], end
end
return π, v
end
```

I recommend you to compare this with the generic value iteration in
order to see what we gain. One thing should be clear: backward induction
does the exact same operation as value iteration for each state, but
only requires a single pass over the state space. This makes it much
more efficient. Another more subtle detail is that since we have more
structure to exploit in the dynamics, the algorithm doesn’t have to
assume so much about the Bellman operator. For example, although
hand-wavy, we just gave a proof of convergence above that has no
dependence on the Banach fixed-point theorem. Thus, as a bonus, backward
induction works for any discount factor *γ*.^{4}

One issue with value iteration is that all policy calculations are implicit, since we just work with value functions. It is then possible that we reach an optimal policy and keep iterating the algorithm to improve the estimate for the value function. In this section, let’s see how we can directly calculate an optimal policy in a finite number of steps.

Our next question is then how to calculate the cost associated with a
policy. Let’s say somebody gave you a policy *π* and told you nothing more about
it. How can you find its value function *v*^{π}? One way is
to notice that it satisfies a recursion similar to the Bellman equation,
but without the minimization step.

*v*^{π}(*s*) = *c*(*s*,*π*(*s*)) + *γ**v*^{π}(*T*(*s*,*π*(*s*)).

If you follow the same steps we did before transforming this equation
into a fixed point problem, you will see that under the same assumptions
of continuity (always valid for finite state and action spaces) it also
has a unique solution. Moreover, turning it into an update procedure
converges towards *v*^{π} for any
initial value function. Thus, we arrive at an algorithm for evaluating
the cost of a policy, unimaginatively called *policy
evaluation*.

```
function policy_evaluation(π, v0=zeros(States); tol)
= copy(v0)
v = Inf
maxerr while maxerr > tol
= 0
maxerr for s in States
= v[s]
prev = c(s, π[s]) + γ*v[T(s, π[s])]
v[s] = max(maxerr, abs(v[s] - prev))
maxerr end
end
return v
end
```

Notice the similarity with value iteration. The only difference is that we are iterating a single evaluation, not an entire optimization problem.

After we know a policy and its value function, our next question is how to improve it. That is, how can we use this information to get nearer to an optimal policy.

The secret lies in locally improving our policy for each state.
Consider a state *s*. The value
*v*^{π}(*s*)
is the total cost of starting at *s* and following *π* thereafter. What if there is an
*a* ∈ 𝒜(*s*) such that
choosing *a* at the first step
and following *π* thereafter is
*cheaper* than just following *π*?

*c*(*s*,*a*) + *v*^{π}(*T*(*s*,*a*)) < *v*^{π}(*s*).

Since we have this improvement at the first step and our processes
are assumed to not depend on anything besides what is represented on the
state *s*, then it must be
better to choose *a* than *π*(*s*) whenever we are at
state *s*. That is, if we define
a new policy

$$ \lambda(x) = \begin{cases} a, & x = s \\ \pi(x), & \text{otherwise} \end{cases} $$

it is always better to follow *λ* then *π*, because *v*^{λ} ≤ *v*^{π}.

We thus arrive at our next algorithm: *policy iteration*. It
consists of taking a policy *π*,
finding its value function *v*^{π} through
policy evaluation and finally using policy improvement to arrive at a
better policy. Since there are only finitely many policies, and we
always obtain a strictly better policy, this algorithm is guaranteed to
converge to an optimal policy in a finite amount of steps.

```
function policy_improvement(π0, v)
π = copy(π0)
for s in States
π[s] = argmin(a -> c(s, a) + γ*v[T(s,a)], Actions(s))
end
return π
end
function policy_iteration(π, v=zeros(States); tol)
while true
0 = π
π= policy_evaluation(π, v; tol=tol)
v π = policy_improvement(π, v)
if π == π0 break end
end
return π, v
end
```

Did you notice that we iteratively alternate between two kinds of passes over the states? This is the mother of all backwards-forwards algorithms people generally associate with dynamic programming.

Just like with value iteration, there’s also a lot of freedom in how we traverse the states. Again it is useful to think of policy iteration more as a principle than as an algorithm in itself and adapt the steps to consider any problem specific information that may be available.

Until now, we’ve only dealt with deterministic processes. Life, on the other hand, is full of uncertainty and, as a good applied field, dynamic programming was created from its inception to deal with stochastic settings.

We call a state machine where the transitions *T*(*s*,*a*) and costs
*c*(*s*,*a*) are
stochastic a *Markov Decision Process* (MDP for short). This name
comes from the fact that the new state only depends on the current state
and action, being independent of the process’ history just like a Markov
chain. A usual intuition for this kind of processes is as the
interaction between an actor and an environment. At each time step, the
environment is at a state *s*
and the actor may choose among different actions *a* ∈ 𝒜(*s*) to interact with
the environment. This action affects the environment in some way that is
out of reach to the actor (this stochastic / non-deterministic),
changing its state to *s*′ = *T*(*s*,*a*)
and incurring a cost of *c*(*s*,*a*) to the
actor, as illustrated in the diagram below.

```
digraph {
rankdir=LR;
compound=true;
{rank = source; A [label = "Actor"]};
subgraph cluster_env{
rank = same;
label="Environment";
node [shape = circle
style = "solid,filled"
color = black
fixedsize = shape
fillcolor = "#B3FFB3"];
s2 [label = "s'"];
s -> s2 [label = "T(s,a)"];
}
A -> s:nw [label = "a" lhead=cluster_env];
s:s -> A [label = "c(s, a)" ltail=cluster_env];
}
```

Allowing non-determinism opens the way for modeling a lot more cool situations. For example, robots that play video games! The states may be the internal state of the game or some partial observation of them that is available to the player and the actions are the buttons on the controller. The transitions are internal to the game and the costs are related to some winning/losing factor. Have you ever heard of Deep Mind’s Playing Atari with Deep Reinforcement Learning paper? In it, they use reinforcement learning to train a robot capable of playing Atari 2600 games and all modeling is done via Markov decision processes in a way that is really similar to the discussion in this paragraph. I really recommend checking it out.

With a stochastic environment, we can’t necessarily predict the costs and transitions, only estimate them. To accommodate that, we will have to change a few things in our definition of policies and value functions. Those will specialize to our old definitions whenever the MDP is stochastic.

A deterministic policy was a choice of action for each state. We
define a *stochastic policy* as a random function choosing an
action given a state. For the value function, we still want to assign a
total cost to each state. A solution is to consider all possible costs
and take their average. The value of a stochastic policy is thus

*v*^{π}(*s*) = 𝔼^{π}[*c*(*s*,*a*)+*γ**v*^{π}(*T*(*s*,*a*))|*a* = *π*(*s*)]

where we consider the action *a* to be randomly sampled with
according to the policy and write 𝔼^{π} for the average value
considering that we follow it. Similarly, the Bellman equation for the
optimal value function also considers the mean cost:

$$ \begin{array}{rl} v^\star(s) = \min\limits_{a} & \mathbb{E}\left[c(s, a) + \gamma v^\star(s')\right] \\ \textrm{s.t.} & s' = T(s, a), \\ & a \in \mathcal{A}(s). \end{array} $$

In this post we will only work the expected value but know that if you were feeling particularly risk-averse, you could exchange it for any coherent risk measure and all results would still hold.

If the (possibly random) states, actions and costs are finite, the
methods of *value iteration* and *policy iteration* also
hold with minor adjustments. One can prove the existence and convergence
of the stochastic Bellman equation with the same assumptions as
before.

The main difference from the deterministic case is that we need to
evaluate the expectation during the process. Let’s call *p*(*s*′,*c*|*s*,*a*)
the probability of getting a transition *s*′ = *T*(*s*,*a*)
and a cost *c*(*s*,*a*) given that
we choose action *a* at state
*s*. With this, the expectation
becomes a weighted sum and we can do *value iteration* exactly as
before, with the only difference that our update rule becomes

$$ v(s) \gets \min\limits_{a} \sum_{s',\,c} p(s', c | s, a)\left(c + \gamma v(s')\right). $$

Similarly, the update rules for policy evaluation is

*v*(*s*) ← ∑_{s′, c}*p*(*s*′,*c*|*s*,*π*(*s*))(*c*+*γ**v*(*s*′))

and the one for policy improvement is

$$ \pi(s) \gets \argmin\limits_{a} \sum_{s',\,c} p(s', c | s, a)\left(c + \gamma v(s')\right). $$

The algorithm for policy iteration is also exactly the same as the deterministic one except for these two lines of code.

Well, we finally reached the end of our overview of dynamic
programming. I hope it was as fun to read as it was for me to write. And
that DP gets the honor place it deserves in your problem solving.^{5} Of course, a single blog post is too
tinny to encompass a subject as vast as DP. So let’s give a little
appetizer of what is out there.

One thing that we barely scraped the surface was continuous state problems, although they being quite important. And I say this last part seriously as someone who works daily with them. In our theoretical discussions, we didn’t really use the finitude of the states, only when we we built the algorithms with vectors and loops. This means that at a mathematical level, both value and policy iteration work for infinite states if you keep in mind the continuity assumptions.

Therefore, if one has a way to traverse the entire state space 𝒮, the algorithms will converge and we may
even get an analytic solution to the Bellman Equation. In practice this
is hard and we have to resort to some kind of *Approximate Dynamic
Programming*. The idea is to approximate the value function *v* with some representation requiring
only a finite amount of information and sample the state space in order
to fit it with the data.

The best way to do that is problem-dependent, of course. For example, if the optimization in the Bellman equation is a linear program,

$$ \begin{array}{rl} v(s) = \min\limits_{a} & c^ta + \gamma v(s') \\ \textrm{s.t.} & s' = Ms + Na, \\ & a \ge 0, \end{array} $$

one can guarantee that *v*
must be convex and piecewise-linear. Thus, at each iteration of our
algorithm we can solve for state *s*^{(i)} and use
linear programming duality to get an affine subapproximation to *v* (called a cutting plane),

$$ v(s) \ge v^{(i)} + \left\langleλ^{(i)}, s - s^{(i)}\right\rangle, \forall s \in \mathcal{S}.$$

Hence instead of solving the recursive equation, we sample the state
space and use this cut approximation of *v*:

$$ \begin{array}{rl} v(s) = \min\limits_{a} & c^ta + \gamma z \\ \textrm{s.t.} & s' = Ms + Na, \\ & a \ge 0, \\ & z \ge 0, \\ & z \ge v_i + λ(s - s_i), \forall i. \end{array} $$

This procedure yields piecewise-linear approximations that eventually
converge to the real optimal value function. The idea we just discussed
forms the basis of *Dual Dynamic Programming*, a method largely
employed by people in Stochastic Programming and Operations
Research.

Another thing we can do when the problem’s structure is not as nice
as a linear program is to use some kind of universal approximator for
*v*, such as a neural network.
This is called *Neural Dynamic Programming* and the work of Dmitri Bertsekas is
a great place to start learning about this.

Besides the algorithms, there are other classes of problems that
admit a Bellman equation but didn’t fit our narrative despite their
importance. One such example are Game problems where we do not have
total control over the actions taken. Differently from MDPs where we
know the transition probabilities for all states, in this case there are
also other players who are supposedly trying to *maximize* your
cost. As an example, think of a game of Poker where the other players
want to maximize your losses (thus minimize their own). In this case,
besides your actions *a*, there
are also actions *α* from the
other players who affect the state transition *s*′ = *T*(*s*,*a*,*α*)
providing a Bellman equation

$$ \begin{array}{rl} v(s) = \min\limits_{a} \max\limits_{\alpha} &\mathbb{E}[c(s, a) + \gamma v(s')] \\ \textrm{s.t.} & s' = T(s, a, \alpha), \\ & a \in \mathcal{A}_\mathrm{you}(s) \\ & \alpha \in \mathcal{A}_\mathrm{enemy}(s). \end{array} $$

Another important instance that I didn’t mention before are control
problems with continuous time. This is a huge area of engineering and
applied math and much of Bellman’s original work was focused on solving
this kind of problem. As an example, think of an airplane flying. It’s
dynamics are given by a differential equation taking account both its
current state *s* and how the
pilot controls the pane *a*,

$$ \frac{ds}{dt} = T(s, a). $$

Think of *s*_{t} as the
plane’s position and direction, while *a*_{t} represents
how much the pilot chooses to accelerate or turn the plane’s yoke at
time *t*. The cost may be the
fuel consumption. As any good physics problem this has continuous time
and the sum over states becomes an integral

$$ \begin{array}{rl} v(s) = \min\limits_{a} & \int_{0}^{\infty} c(s(t), a(t))dt \\ \textrm{s.t.} & \frac{ds}{dt} = T(s(t), a(t)), \\ & s(0) = s, \\ & a \in \mathcal{A}(s). \\ \end{array} $$

By following similar steps to our deduction of the discrete time
Bellman equation plus a lot of hard analysis^{6},
one arrives at a non-linear partial differential equation called the
*Hamilton-Jacobi-Bellman equation*:

*v*(*s*) = min_{a}*c*(*s*,*a*) + ⟨∇*v*(*s*),*T*(*s*,*a*)⟩.

This equation analogously describes the optimal value of a state recursively but now it is much harder to solve. The methods employed generally involve discretization allied with, our by now old friends, value or policy iteration.

After this final overview, it’s indeed time to end our journey, nevertheless knowing that there is much more to explore out there. I will try to post more about dynamic programming and specially its connections to Stochastic Optimization and Reinforcement Learning but I don’t really know when I will have the time.

Farewell and see you next time!

In this appendix we show that the Bellman Operator

$$ \begin{array}{rl} (\mathcal{B}v)(s) = \min\limits_{a} & c(s, a) + \gamma v(s') \\ \textrm{s.t.} & s' = T(s, a), \\ & a \in \mathcal{A}(s) \end{array} $$

satisfies all the requisites to be a contraction over the space of bounded continuous functions.

To prove that ℬ is a contraction, we
will start with another of its properties: *monotonicity*.
Besides this proof, it will also be useful in the future when we talk
about policy improvement.

Suppose that we have two value function *v* and *w* such that *v* estimates costs uniformly lower
than *w*:

*v*(*s*) ≤ *w*(*s*), ∀*s* ∈ 𝒮.

Then the Bellman operator preserves this relationship: (ℬ*v*)(*s*) ≤ (ℬ*w*)(*s*), ∀*s* ∈ 𝒮.

Given a state *s*, we get
from *v* ≤ *w* that the
following inequality holds for any action *a* ∈ 𝒜(*s*):

*c*(*s*,*a*) + *γ**v*(*T*(*s*,*a*)) ≤ *c*(*s*,*a*) + *γ**w*(*T*(*s*,*a*)).

Since this is valid for any *a*, taking the minimum on both sides
preserves the inequality. Thus

$$ \min_{a \in \mathcal{A}(s)} c(s, a) + v(T(s,a)) \le \min_{a \in \mathcal{A}(s)} c(s, a) + w(T(s,a)) \\ (\mathcal{B}v)(s) \le (\mathcal{B}w)(s). $$

Finally, let’s prove that the Bellman operator contracts the space of
bounded continuous functions by the discount factor. What we need to
show is that for any value function *v*, *w*:

∥ℬ*v* − ℬ*w*∥_{∞} ≤ *γ*∥*v* − *w*∥_{∞}.

From the definition of the uniform norm, we get that for any state
*s*,

$$ v(s) - w(s) \le \|v - w\|_\infty \\ v(s) \le w(s) + \|v - w\|_\infty. $$

From the monotonicity we just proved, applying ℬ to both sides preserves this inequality:

(ℬ*v*)(*s*) ≤ ℬ(*w*+∥*v*−*w*∥_{∞})(*s*).

Let’s show that the constant factor ∥*v* − *w*∥_{∞} only
shifts the new function ℬ*w* by
uniformly by another constant. Calling it *k* to declutter the notation,

$$ \begin{array}{rlll} (\mathcal{B}(w) + \|v - w\|_\infty)(s) &= &\min\limits_{a} & c(s, a) + \gamma (w(s') + \|v - w\|_\infty) \\ &&\textrm{s.t.} & s' = T(s, a), \\ && & a \in \mathcal{A}(s) \\ &= &\min\limits_{a} & c(s, a) + \gamma (w(s')) + \gamma \|v - w\|_\infty \\ &&\textrm{s.t.} & s' = T(s, a), \\ && & a \in \mathcal{A}(s). \end{array} $$

This proves that

$$ \begin{aligned} (\mathcal{B}v)(s) &\le (\mathcal{B}w)(s) + \gamma \|v - w\|_\infty \\ (\mathcal{B}v)(s) - (\mathcal{B}w)(s) &\le \gamma \|v - w\|_\infty. \end{aligned} $$

By doing the same derivation in the opposite direction (for *w* − *v*) we get an inequality
for the absolute value. Applying the supremum, it becomes the result we
want.

$$ \begin{aligned} |(\mathcal{B}v)(s) - (\mathcal{B}w)(s)| &\le \gamma \|v - w\|_\infty \\ \sup_{s\in\mathcal{S}} |(\mathcal{B}v)(s) - (\mathcal{B}w)(s)| &\le \gamma \|v - w\|_\infty \\ \|\mathcal{B}v - \mathcal{B}w\|_\infty &\le \gamma \|v - w\|_\infty. \end{aligned} $$

Since ℬ is a contraction, the Banach
fixed point theorem guarantees to us that there exists a unique value
function *v*^{⋆}
satisfying the Bellman equation. Furthermore, the theorem also points us
towards a way to solve this kind of procedure with polynomial complexity
on the size of the state and action spaces. This is the topic we’re
going to investigate next.

This post come to life after a series of conversations I had with Pedro Xavier. The good thing of explaining something to a smart person is that you end up learning a lot in the process. Sometimes you even learn enough to write a blog post about it.

I’m also in debt with Ivani Ivanova for being such a great typo hunter. If there is any typo left, it is because I’m lazy… She did a marvelous job.

To be more precise, we work with hydrothermal dispatch problems, where one must decide between many sources of energy (hydro, thermal, renewable) to supply a certain power demand taking into account the uncertainties of the future. For example: hydro is cheap and clean, but you risk running out of water if you use all of it and next month turns out particularly dry. Finding the best energy dispatch is once again solved via dynamic programming.↩︎

Even Richard Bellman admittedly named it based on how cool it sounds.↩︎

It is out of scope more because it is a tangent to the topic than because of any difficulty in the proof. If you are interested in analysis, I really recommend you to try proving it. The proof consists of using the contraction property to show that the distance between the iterations of

*f*must converge towards zero.↩︎In contrast with value iteration, it also has no dependence on the costs being real numbers. In this case any ordered ring (such as ℕ or ℤ) would do. But I digress… This is out of scope for this post.↩︎

In fact, fixed points and recursion deserve the spot. They’re everywhere!↩︎

The derivation mostly consists of Taylor expansions to be sincere. The tricky part is justifying your steps.↩︎

By the way, in case this is the first time you hear about it: memoization is a programming technique where instead of letting a function calculate the same value whenever it is called, we instead store the already calculated value somewhere and just do a lookup instead of redoing the entire calculation.

What I like the most in this post’s code is that we’re going to delve deeply into the realm of abstraction to then emerge with a concept with clear and practical applications!

```
{-# LANGUAGE DeriveFunctor, TypeFamilies #-}
{-# LANGUAGE ScopedTypeVariables, RankNTypes #-}
{-# LANGUAGE TypeApplications, AllowAmbiguousTypes #-}
import Numeric.Natural
import Data.Kind (Type)
```

```
digraph "Sequences are functions" {
fontname = "monospace";
rankdir = TB;
newrank = true;
ranksep = 0.7;
nodesep = 0.9;
size = "8,5";
concentrate = true;
node [shape = circle
style = "solid,filled"
color = black
fixedsize = shape
fillcolor = invis];
f [shape = square
fillcolor = red];
subgraph cluster_s {
rank = same;
style = invis;
ldots [fontsize=20 color = "#00000000" fillcolor= "#00000000" label = ". . ."];
subgraph cluster_ldots {
ldots;
}
0 [label = "f 0"];
1 [label = "f 1"];
2 [label = "f 2"];
3 [label = "f 3"];
0 -> 1 -> 2 -> 3 -> ldots;
}
edge [dir=both,arrowhead=odot, arrowtail=odot, color="#00000055"];
f -> {0,1,2,3, ldots};
}
```

An important fact that is normally briefly alluded in any mathematics
book and immediately forgotten by (almost) every reader is that whenever
you see a subindex such as *x*_{n}, what it in
fact denotes is a function application *x*(*n*).^{1} Now
consider the datatype of infinite streams as in the previous post:

```
data Stream a = a :> Stream a
deriving Functor
infixr 5 :>
```

This type models sequences of type `a`

, the kind of thing
a mathematician would denote as {*x*_{n}}_{n ∈ ℕ}.
Oh look, there’s a subindex there! Our previous discussion tells us that
we should be able to interpret a `Stream a`

as a function
`Natural -> a`

. This is done by indexing: we turn a stream
`xs`

into the function that takes `n : Natural`

to
the nth element of `xs`

. The definition is recursive, as one
might expect:

```
-- Access the nth value stored in a Stream
streamIndex :: Stream a -> (Natural -> a)
:> _) 0 = x
streamIndex (x :> xs) n = streamIndex xs (n-1) streamIndex (_
```

Conversely, given any `f : Natural -> a`

, we can form a
stream by applying *f* to each
natural number to form something like
`[f 0, f 1, f 2, f 3,..., f n,...]`

. Since
`Stream`

is a functor, we achieve this by mapping *f* into the stream of natural
numbers:

```
-- Take f to [f 0, f 1, f 2, f 3, ...]
streamTabulate :: (Natural -> a) -> Stream a
= fmap f naturals where
streamTabulate f = 0 :> fmap (+1) naturals naturals
```

These functions are inverse to one another:

```
streamIndex . streamTabulate = id
streamTabulate . streamIndex = id
```

Meaning that we thus have a natural isomorphism
`Stream a ≃ Natural -> a`

. This is a strong assertion and
means that, mathematically speaking, Streams and functions from the
Naturals are essentially the same thing. We are doing Haskell in here,
however, not pure math. And in a programming language meant to run in a
real computer, not only in the realm of ideas, we also must take into
account something more: how are those types laid out into memory?

In the case of functions, they are compiled to chunks of instructions
that calculate some value. Specially, if you have some rather expensive
function `f : Natural -> a`

and have to calculate
`f n`

in many parts of your program for the same
`n`

, all work will have to be redone each time to get that
sweet value of type `a`

.^{2} Streams, on the other
hand, are lazy infinite lists and, because of Haskell’s call-by-need
evaluation strategy, its components remain saved into memory for
reuse.^{3}

This last paragraph is the heart of memoization in Haskell: one does not memoize functions, one turns functions into data and the language automatically memoizes the data. I don’t know about you, but I find this very cool indeed.

A large inspiration to this post comes from the great introduction to memoization in the Haskell wiki. Thus, we will follow their lead and explore the Fibonacci sequence as a recurring example throughout this post.

I must admit that I find illustrating a recursion concept through the Fibonacci numbers kind of a cliché… Nevertheless, clichés have their upside in that you, the reader, will have seen them so much that may even perhaps feel familiar with what we will be doing here. The Fibonacci numbers are also a well-known example where memoization can make a function go from exponential to polynomial complexity. Well, let’s start with their usual recursive definition:

```
fibRec :: Num a => Natural -> a
0 = 0
fibRec 1 = 1
fibRec = fibRec (n-1) + fibRec (n-2) fibRec n
```

Although elegant, this definition is *extremely slow*. Running
`fibRec 100`

on ghci already took much longer than I was
disposed to wait… The problem is that the recursion has to calculate the
same arguments a lot of times, leading to an exponential complexity.

Since the problem is overlapping calculations, we can accelerate this
function using memoization. But in this case, just turning it into a
stream is not enough, because the `fibRec`

will still use the
slow definition to build each of the sequence’s component. But fear
nothing, there is a salvation! It starts by writing the function in
operator form, instead of using recursion, just like we did with the
Bellman Equation in my previous
post about dynamic programming.

```
fibOp :: Num a => (Natural -> a) -> (Natural -> a)
0 = 0
fibOp v 1 = 1
fibOp v = v (n-1) + v (n-2) fibOp v n
```

You can think of `fibOp`

as one step of the Fibonacci
recursion, where `v`

is a function that knows how to continue
the process. Another way to look at it, that is closer to the dynamic
programming view, is that if `v`

is an estimation of the
Fibonacci values then `fibOp v`

will be an improved
estimation given the available information. No matter what view you
choose, the important part to us is that the fixed point of
`fibOP`

is `fibRec`

.

```
= let x = f x in x
fix f
fibNaive :: Num a => Natural -> a
= fix fibOp -- same as fibRec fibNaive
```

Where we called it `fibNaive`

because it would be rather
naive to do all this refactoring in order to arrive at the exact same
thing…

Alright, with `fix`

we have all the necessary building
blocks to accelerate our calculations. It’s now time to fit them
together! Before fixing the operator, we will turn it into something
that “keeps a memory”. If we compose our tabulation function with
`fibOp`

, we get a function turns a function `v`

into a Stream, over which we can index to get back a function. In this
case, however, the same stream is shared for all arguments. Thus, the
fixed point indexes into this Stream during the recursion process!
Moreover, there is nothing specific to the Fibonacci sequence in this
process, so we can abstract this procedure into a separate function.

```
streamMemoize :: ((Natural -> a) -> Natural -> a) -> Natural -> a
= fix (streamIndex . streamTabulate . f)
streamMemoize f
fibSmart :: Num a => Natural -> a
= streamMemoize fibOp fibSmart
```

Notice that by our previous discussion,
`streamIndex . streamTabulate`

equals `id`

. Thus,
by construction, `fibNaive`

and `fibMemo`

are also
equal as functions. Nevertheless, their runtime behavior is considerably
different! As Orwell would put it: in terms of execution, some equals
are more equal than others.

Very well, What is a function `k -> a`

after all? The
textbook definition says it is a rule that for each element of type
`k`

associates a unique element of type `a`

. The
previous examples have shown us that there are data structures which, in
the sense above, behave a lot like functions. For example, we saw how to
convert between Streams and functions and even used it to accelerate the
calculation of recursive functions. In the case of Streams, both
`streamIndex`

and `streamTabulate`

are natural
transformations^{4}, meaning that there is a natural
isomorphism between streams and functions with `Natural`

as
domain:

`forall a. Stream a ≃ Natural -> a.`

We call a functor isomorphic to a type of functions, a
**Representable Functor**. Those have important
applications in Category Theory because they are closely related to
universal properties and elements. However, today we are interested in
their more mundane applications, such as memoizing domains other than
the Naturals.

In Haskell, we can codify being Representable as a typeclass. It must have an associated type saying to which function type the Functor is isomorphic, together with two natural transformations that witness the isomorphism.

```
class Functor f => Representable f where
type Key f :: Type
tabulate :: (Key f -> a) -> f a
index :: f a -> (Key f -> a)
-- Class laws:
-- index . tabulate = id
-- tabulate . index = id
```

As you can imagine, there is a Representable instance for Streams using what we have defined in the previous sections.

```
instance Representable Stream where
type Key Stream = Natural
index = streamIndex
= streamTabulate tabulate
```

Another interesting instance is for functions themselves! After all, the identity is, strictly speaking, a natural isomorphism.

```
instance Representable ((->) k) where
type Key ((->) k) = k
index = id
= id tabulate
```

With some type-level magic, we can write a generalized memoization procedure. It has a scarier type signure, since we’re striving for genericity, but the idea remains the same: precompose with tabulate and index before fixing. The function is essentially the same we wrote before for Streams for parameterized on our Representable Functor of choice.

```
-- | Memoize a recursive procedure using a Representable container of your choice.
memoize :: forall f a. Representable f => ((Key f -> a) -> (Key f -> a)) -> (Key f -> a)
= fix (index @f . tabulate . g) memoize g
```

We can recover our Stream-memoized Fibonacci by telling
`memoize`

that we choose `Stream`

as the
container:

```
fibSmart' :: Num a => Natural -> a
= memoize @Stream fibOp fibSmart'
```

The function above is the same as our “more manual”
`fibSmart`

from before. As a matter of fact, even the naive
recursion is case of these memoization schemes! By using the
Representable instance of functions, the methods do nothing, and we get
a memoization scheme that has no storage. Well, this is equivalent to
our naive approach from before.

```
fibNaive' :: Num a => Natural -> a
= memoize @((->) Natural) fibOp fibNaive'
```

With our Representable machinery all set, it would be a shame to end
this post with just one example of memoization.^{5} So,
let’s see how we can memoize the Fibonacci function using an infinite
binary tree structure. This is a fun example to look at because the
isomorphism is not as straightforward as with Streams and because it is
*much faster*. We begin by defining our datatype.

```
data Tree a = Node a (Tree a) (Tree a)
deriving Functor
```

By now, you should already know how the memoization works at a high-level, so we can just define it as before.

```
fibTree :: Num a => Natural -> a
= memoize @Tree fibOp fibTree
```

Alright, how are these `Tree`

s representable? In the case
of `Stream`

s, the relation was clear: we kept decreasing the
index and advancing on the Stream until the index was zero. And this
amounted to recursing on a unary representation of the Naturals: we
advanced at a successor and stopped at zero. The secret to translate
this idea to `Tree`

s is to look at a natural number as
written in binary. I personally find this relation easier to explain
with a drawing. So, while we index Streams with a linear approach,

```
digraph "Stream indexes" {
rankdir=LR;
size="8,5"
node [shape = circle
style = "solid,filled"
color = black
fixedsize = shape
fillcolor = invis];
subgraph cluster_ldots {
rank = same;
style = invis;
ldots [fontsize=20 color = "#00000000" fillcolor= "#00000000" label = ". . ."];
}
0 -> 1 -> 2 -> 3 -> 4 -> 5 -> 6 -> ldots;
}
```

For Trees, we index using a breadth-first approach.

```
digraph "Tree indexes" {
rankdir=TD;
size="8,5"
node [shape = circle
style = "solid,filled"
color = black
fixedsize = shape
fillcolor = invis];
subgraph cluster_ldots {
rank = same;
style = invis;
node [fontsize=20 color = "#00000000" fillcolor= "#00000000" label = ". . ."];
{7 8 9 10 11 12 13 14};
}
0 -> {1, 2};
1 -> {3, 4};
2 -> {5, 6};
3 -> {7, 8};
4 -> {9, 10};
5 -> {11, 12};
6 -> {13, 14};
}
```

By the way, I don’t know if the figure above makes it clear since we
are starting with zero, but this arrangement is equivalent to branching
on the based on the number’s evenness. We descend left on odd numbers
and right on even numbers. The crucial part of this representation is
that we are able to reach the n-th index in `O(log n)`

steps,
instead of the `n`

steps required for the Stream. Alright,
it’s time to turn all this talking into a proper instance!

Let’s begin with some helpers to make the evenness check more explicit.

```
data Eveness = Even | Odd
evenness :: Integral a => a -> Eveness
= if odd n then Odd else Even
evenness n
instance Representable Tree where
type Key Tree = Natural
```

We tabulate a function using the same ideas as for Streams: create a Tree of natural numbers and map the function over it. The tree is created by branching into even and odd numbers.

```
= fmap f nats where
tabulate f = Node 0
nats fmap (\ n -> 2*n + 1) nats)
(fmap (\ n -> 2*n + 2) nats) (
```

For indexing, we test for the number’s evenness and branch accordingly until we’re looking for the zeroth index.

```
index (Node a _ _) 0 = a
index (Node _ l r) n = case evenness n of
Odd -> index l (div n 2)
Even -> index r (div n 2 - 1)
```

With this we finish our stroll through memoization-land. By the way, this post is a literate haskell file. Thus, I really recommend you to take the functions in it and try some benchmarks to see how much the memoization helps. From my own tests in ghci, the Tree Fibonacci is much faster than the other two. But compiling with optimizations, the Stream approach gets much faster, so I might need a better benchmark in there.

- Much of the Fibonacci example is adapted from the Haskell wiki page on memoization.
- Chapter
14 of Bartosz Milewski’s great book
*Category Theory for Programmers*. - The adjunctions package on Hackage.

I must comment, perhaps to the abhorrence of my Haskell readers, that I’m rather fond Fortran. One thing I like about it is how the language uses the same syntax to denote vector indexing and function application:

*x*(*n*).↩︎In fact, not all work must be necessarily redone. If we are calculating

`f n`

twice in the same context, the compiler may be smart enough to do some sharing and only calculate it once. Of course, this works in Haskell thanks to the magic of referential transparency.↩︎Well, at least until the next gc pass if it is not in the top-level.↩︎

In Haskell, any polymorphic function

`h :: forall a. F a -> G a`

, where`F`

,`G`

are functors, is a natural transformation.↩︎Well, two if you count the naive approach.↩︎

In this post, we will take a look at the three simplest varieties of recursion schemes: catamorphims, anamorphisms, and hylomorphisms; all of them deeply linked with structural induction. As we will see, they respectively encapsulate the notions of folding a data structure, constructing a data structure and using a data structure as an intermediate step.

The first time I heard about recursion schemes was after stumbling with the paper Functional Programming with Bananas, Lenses, Envelopes and Barbed Wire by Erik Meijer, Maarten Fokkinga, and Ross Paterson. It is a real gem of functional wisdom but the authors use a notation called Squiggol, which I had a really hard time trying to grasp. Fortunately, I later found Patrick Thomson’s excellent series of posts, which explain recursion schemes using Haskell code. After absorbing this content, I tried to explain the idea to some friends who don’t know Haskell but know some category theory. This post is an enriched version of that presentation with more cohesion and much less of my bad drawings.

In the spirit of the original presentation, I’ve decided to make this post completely language-agnostic. I also tried to keep any type theory or lambda calculus to a minimum in order to make it easier to grasp. My hope is that anyone with some experience in programming and mathematics will be able to read this, instead of confining this post only to people that know a certain programming language. Of course, in doing that I’m risking that no one besides me will actually understand this post. But, very well, I will do my best.

We begin our journey in the prehistory of programming; that time when computers where still mostly made of coconuts and mammoth fur. At the time, the way to program was quite unstructured. The code of a program was essentially given by a sequence of commands whereas the data was simply piled up in a stack.

```
\def\colors{red!30!blue!50, red!50, green!30, blue!50, blue!50, red!50, green!30, red!30!blue!50, blue!20}
\begin{scope}[start chain=ctrl going right,
node distance=0.3mm]
{ [minimum size=0.5cm,
tmcell/.style={fill,draw=black, rounded corners=1.618},
every join/.style={-{Latex[length=1mm]}, shorten <= 0.5mm,shorten >= 0.5mm, in=-110, out=-80, looseness=2}]\foreach \c in \colors
\node [tmcell,fill=\c, on chain, join] {\phantom{\tt goto}};
}\node [on chain] {~~~};
{ [orange, node distance = 0.2mm, minimum size=0.4cm]\node [fill, on chain] {};
{ [start branch=stckup going above]\foreach \i in {1,...,2}
\node [draw,on chain] {};
}
{ [start branch=stckdown going below]\foreach \i in {1,...,3}
\node [fill, on chain] {};
\node[on chain, black] (stacklabel) {Stack};
}
}\node[yshift=-0.9cm] at (ctrl-1.south east) {Control flow};
\end{scope}
```

But the program does not have to execute the instructions in order.
Oh no, it would be rather boring and inexpressive… In this programming
paradigm, there is a special command called `goto`

which
allows one to jump to any labeled part of the program. In fact, this is
the only command responsible for the program’s control flow.

```
{ [start chain=ctrl going right,
node distance=0.3mm,
minimum size=0.5cm,
tmcell/.style={fill,draw=black, rounded corners=1.618},
arrstl/.style={-{Latex[length=1mm]}, shorten <= 0.5mm,shorten >= 0.5mm}]
\node[tmcell, fill=red!30!blue!50, on chain] {\phantom{\tt goto}};
\node[tmcell, fill=red!50 , on chain] {\phantom{\tt goto}};
\node[tmcell, fill=green!30 , on chain] {\tt goto};
\node[tmcell, fill=blue!50 , on chain] {\phantom{\tt goto}};
\node[tmcell, fill=blue!50 , on chain] {\phantom{\tt goto}};
\node[tmcell, fill=red!50 , on chain] {\phantom{\tt goto}};
\node[tmcell, fill=green!30 , on chain] {\tt goto};
\node[tmcell, fill=red!30!blue!50, on chain] {\phantom{\tt goto}};
\node[tmcell, fill=blue!20 , on chain] {\phantom{\tt goto}};
\draw[arrstl] (ctrl-1) to [in=-110, out=-80, looseness=2] (ctrl-2);
\draw[arrstl] (ctrl-2) to [in=-110, out=-80, looseness=2] (ctrl-3);
\draw[arrstl] (ctrl-3) to [in=80, out=110] (ctrl-6);
\draw[arrstl] (ctrl-6) to [in=-110, out=-80, looseness=2] (ctrl-7);
\draw[arrstl] (ctrl-7) to [in=110, out=80] (ctrl-2);
}
```

The `goto`

is a pretty powerful construction, but it also
has its caveats. So much for its use to be considered
harmful. Although any kind of program flow can be constructed using
only `goto`

, it may become too confuse. The `goto`

statement makes it too easy to write spaghetti code, which is
practically impossible to debug.

After prehistory, we arrive into the societal period of imperative
programming. In here, the programming languages become
*structured*. Data is no long viewed as simply a stack of memory
but classified into types. There are some primitive types as well as
structures to combine them into more complex types. In a language like
C, for example, there are `struct`

types, `union`

types, and array types.

The changes also arrived to the control flow and an effort was made
to tame the once wild `goto`

. The programmers from that time
analysed its most common use cases and created new statements fulfilling
each of these. You are probably acquainted to them as `for`

loops, `while`

loops, `switch/case`

statements,
function and subroutine declarations and `if/else`

statements; just to name a few.

Both in regards to data and control flow, the movement we encounter
in here consists of substituting general purpose structures that
concretely represent the instructions we give to the processor by
specific structures having more abstract roles in the code. In terms of
computational expressiveness, nothing changes. What the processor sees
is the same in both unstructured and structured programming. The benefit
lies in the programmer’s expressiveness. With more specific statements,
it becomes easier to write larger, more complex programs as well as
properly debug them. As an example, we may notice that it is possible to
guarantee that a `for`

loop always terminates if it is
iterating over a block of code that is known to terminate. No strange
thing can happen. On the other side, the general character of the
`goto`

allows us to simulate the behavior of a
`for`

but there is no guarantee that a program with a
`goto`

statement must terminate.

Until now, we were looking at a programming paradigm called
*imperative programming*, but the world of programming languages
is not so uniform. Other paradigms exist. And the one that mostly fits
what we will see today is called *functional programming*.

While the imperative paradigm views the code as a sequence of
commands the programmer gives the computer to execute (hence the name,
*imperative*), the functional paradigm views the code as a
composition of functions in the mathematical sense. A function takes a
value as input, does some processing to it, calling other functions for
this, and returns another value as output.

```
\begin{scope}[start chain=funcs going right,
every join/.style={-{latex}, thick},
minimum size=0.8cm]\node[on chain] {input};
\node[fill=blue!20!green!20, draw=black, circle, on chain, join] {$f$};
\node[fill=red!20, draw=black, circle,, on chain, join] {$g$};
\node[on chain,join] {$\cdots$};
\node[fill=blue!70!red!30, draw=black, circle, on chain, join] {$h$};
\node[on chain, join] {output};
\end{scope}
```

If every program only consisted of applying a finite amount of previously defined functions, the language’s expressiveness would be rather limited. To overcome this, we need some form of control flow, which is achieved via recursion.

A *recursive function* is a function that, in order to process
its input into the output, may call itself in an intermediate step.
Probably the most famous recursive function is the factorial, defined as
$$ \operatorname{\mathrm{fat}}(n) =
\begin{cases}
1,& n = 0 \\
n \cdot \operatorname{\mathrm{fat}}(n-1),& \text{otherwise}.
\end{cases}
$$

The expressiveness gained from recursion is essentially the same as
the one gained from `goto`

in imperative languages. That is,
the control flow given by function composition together with recursion
allows one to do anything imaginable with the programming language.
However, all this expressiveness comes with its caveats. It is too easy
to write a functional spaghetti code if recursion is used
indiscriminately. Because of all its power, recursive code lacks in
safety. It would even be fair to say that, like `goto`

, it is
too unstructured.

The idea of structured control flow really caught on in the
imperative world. One hardly sees a wild `goto`

in the middle
of a modern piece of code. In fact, many languages don’t even allow it.
In the functional world, on the other side, the trendy for taming
recursion never really caught on. Despite many functional languages
organizing their data using types, control is still done using crude
recursion.

Recursion schemes are ways to organize and structure different kinds
of recursive functions. Instead of writing a recursive function in terms
of itself, we define higher order functions that receive an ordinary
function as argument, do some recursive magic on it, and return a
recursive analogue of that function as output. It is similar to how a
`while`

loop takes a boolean statement and a block of code,
and turns them into a repeating block of code. If it seems too
confusing, just keep on. What I mean in here will become clearer after
we construct catamorphisms.

Before we end this motivation and proceed to the actual construction
of recursion schemes, there is an intuition that I believe useful to
have in mind. In imperative programming, there is a close relationship
between how we structure data and how we structure the control flow.
Working with arrays almost asks the programmer to design programs with
`for`

loops. Similarly, it is natural to deal with
`union`

types using `switch`

statements (also
called `case`

or `cond`

in some languages).
Recursion schemes will arise as an analogous to this idea in the
functional programming setting. So far so good for motivation, let’s
dive into some math.

The simplest form to write a type is by enumerating its elements such
as `B`

`o`

`o`

`l`

≔ true ∣ false .
This way, we define the type of boolean values. That is, a type with
exactly two terms called true and
false . In general any finite type can
be written just by enumerating its elements. As another example, there
is the type of musical notes `N`

`o`

`t`

`e`

`s`

≔ `d`

`o`

∣ `r`

`e`

∣ `m`

`i`

∣ `f`

`a`

∣ `s`

`o`

`l`

∣ `l`

`a`

∣ `s`

`i`

.
Nevertheless, in practice we also want to deal with types that are more
complex than finite collections. The solution to this is assuming that
the programming language comes with some built-in types such as
integers, characters, and floating-point numbers, together with
structures that allow us to compose these types.

One common compound type consists of a structure capable of storing
data of more than one type at the same time. As an example, let’s say we
are in a spacial war against many alien species. Your job is to
catalogue how many battleships each army has. One way to store this data
is with a type containing a string for the species which the army
belongs together with a natural number for how many battleships they
have. We will write this type as `A`

`r`

`m`

`y`

≔ ships `S`

`t`

`r`

`i`

`n`

`g`

× ℕ.
Here, ships : `S`

`t`

`r`

`i`

`n`

`g`

× ℕ → `A`

`r`

`m`

`y`

is called a *constructor* for the type `A`

`r`

`m`

`y`

.
Constructors are (possibly multivariate) functions that receive terms as
arguments and return a term of a compound type.

After you finish your catalogue, orders arrive for you to develop a
new laser cannon to be used in the war. This weapon should scan the sky
to find the enemy armies’ positions (modeled in a type `P`

`o`

`s`

and shoot them. But beware! There are also allied bases around,
encapsulated in the type `B`

`a`

`s`

`e`

.
Since friendly fire is really far from the ideal, our target type should
have two different constructors, one for allies and another for enemies:
$$\begin{aligned}
\mathtt{Target} \coloneqq &\;\operatorname{\mathrm{ally}}\;
\mathtt{Position} \times \mathtt{Base} \\
\mid &\;\operatorname{\mathrm{enemy}}\;
\mathtt{Position} \times \mathtt{Army}.
\end{aligned}$$ In here we extended our notation for enumerating
types to also accept constructors. Different constructors always produce
different terms of the compound type, thus, we may view the functions
ally : `P`

`o`

`s`

`i`

`t`

`i`

`o`

`n`

× `B`

`a`

`s`

`e`

→ `T`

`a`

`r`

`g`

`e`

`t`

and enemy : `P`

`o`

`s`

`i`

`t`

`i`

`o`

`n`

× `A`

`r`

`m`

`y`

→ `T`

`a`

`r`

`g`

`e`

`t`

as tags representing from which type our `T`

`a`

`r`

`g`

`e`

`t`

was constructed. This works as if we are storing an element of type
`A`

`r`

`m`

`y`

together with a tag enemy on the type
`T`

`a`

`r`

`g`

`e`

`t`

.
So the definition of target is saying that its terms are of the form
ally (*p*,*x*) or enemy (*p*,*x*), just like we
previously enumerated the terms of finite types.

This point of view allows us to define functions on compound types by
enumerating what it does on the different constructors, a method called
*pattern matching*. For example, the function not : `B`

`o`

`o`

`l`

→ `B`

`o`

`o`

`l`

is defined as $$ \begin{aligned}
&\operatorname{\mathrm{not}} \mathop{\mathrm{true}}&=&\;
\mathop{\mathrm{false}}& \\
&\operatorname{\mathrm{not}}
\mathop{\mathrm{false}}&=&\;\mathop{\mathrm{true}}.&
\end{aligned}$$ While our ally-aware cannon shooting function may
be defined as something like $$
\begin{aligned}
&\operatorname{\mathrm{shoot}}(\operatorname{\mathrm{enemy}}(p,
y)) &=&\; \mathtt{laser\_blast}(p) \\
&\operatorname{\mathrm{shoot}}(\operatorname{\mathrm{ally}}(p,
x)) &=&\; \mathtt{wave\_your\_hand}(p).
\end{aligned}$$ Taking advantage of the type system via pattern
matching is a nice way to make it obvious that our code does what it
should. In this case, the types don’t allow us to obliterate our
friends.

Although these compound types are nice to organize our data, the true
power of types comes from taking advantage of two other constructions:
*function types* and *fixed point types*. The function
type between two types *A* and
*B*, represents all the
functions receiving an input of type *A* and returning an argument of type
*B*. To mimic the usual notation
for type signatures, this function type is denoted as *A* → *B*. A type system with
function types allows us to define higher-order functions. For example,
given a function *ϕ*: *A* → *B*, the
composition operator *K*_{ϕ} defined as
*K*_{ϕ}*f* = *f* ∘ *ϕ*
has type signature *K*_{ϕ}: (*B*→*C*) → (*A*→*C*).
As you can see, it takes a function and turns it into another one. This
was a simple example but be sure that many more will soon come, since
higher-order functions are at the heart of recursion schemes.

Now we arrive at the last kind of compound type we are going to
discuss today: fixed point types. These are the star of today’s show, so
pay attention. We may define a compound data type that is parameterized
by some variable (a function between types, if you prefer) such as 𝚖𝚊𝚢𝚋𝚎 *A* ≔ nothing ∣ just *A*.
Given a type *A*, a term of type
𝚖𝚊𝚢𝚋𝚎 *A* is either nothing or the constructor just applied to a term of type *A*. Thus, 𝚖𝚊𝚢𝚋𝚎 receives a type and augments it with a
special point^{2}.

Given a type operator *F*, a
fixed point of *F* is a type
*X* satisfying *X* ≃ *F**X*. In here,
we use the isomorphism sign ≃ instead
of equality because there is some boilerplate involved regarding tags.
Although this is a simple construction, the concept is extremely
powerful, being directly connected to recursion.

As an example let’s explore the fixed point of 𝚖𝚊𝚢𝚋𝚎 . It is a type *N* satisfying *N* ≃ 𝚖𝚊𝚢𝚋𝚎 *N* = nothing ∣ just *N*.
This means that any term of *N*
is either nothing or just a term of *N*. Since we know nothing , we can construct a new term just (nothing), then just (just(nothing)), and proceed
successively in this way. Thus for any natural number *n*, applying just to nothing *n* times defines a unique term of
*N*. Moreover, the definition of
*N* says that all of its terms
are of this form, meaning that *N* is isomorphic to the natural
numbers^{3}.

If you find this last result strange, remember that the natural numbers are inductively defined as being either zero or the successor of another natural number. Using our notation for types, this is equivalent to saying ℕ = zero ∣ succ ℕ. If we alter the tag names, this tells us that ℕ is a fixed point of 𝚖𝚊𝚢𝚋𝚎 as expected.

The natural numbers are surely the most famous example of an
inductive type. Nevertheless, almost no one thing of them like that
while programming. Treating 3 as succ (succ(succ(zero))) would be cumbersome,
to say the least. Any language comes with a (possibly signed) integer
type already bundled with the usual arithmetic operations defined for
it. Thus, let’s discuss a little bit about another inductive type that
is also pretty famous but has more of a inductive data structure flavour
to it: the *linked list*.

Let *A* be your favorite
type. Intuitively, a list over *A* is a finite-length sequence of
elements of *A*. The simplest
case possible is an empty list. To represent a non-empty list, we notice
that any non-empty list with *n*
elements may be “factorized” into its first element and another list
with the remaining *n* − 1
elements.

```
\begin{scope}[list/.style={rectangle split, rectangle split parts=2,
draw, rectangle split horizontal}, >=stealth, start chain]\node[list,on chain] (A) {$10$};
\node[list,on chain] (B) {$150$};
\node[list,on chain] (C) {$87$};
\node[on chain,draw,inner sep=6pt] (D) {};
\draw (D.north east) -- (D.south west);
% \draw (D.north west) -- (D.south east);
\draw[Circle->] let \p1 = (A.two), \p2 = (A.center) in (\x1,\y2) -- (B);
\draw[Circle->] let \p1 = (B.two), \p2 = (B.center) in (\x1,\y2) -- (C);
\draw[Circle->] let \p1 = (C.two), \p2 = (C.center) in (\x1,\y2) -- (D);
\end{scope}
```

Thus, by defining a type operator *P* as *P**X* ≔ nil ∣ cons *A* × *X*,
the previous discussion shows that the type of lists over *A*, hereby denoted *L*(*A*), is a fixed point of
*P*, *L*(*A*) ≃ nil ∣ cons *A* × *L*(*A*).
Here, the constructor nil takes the
role of the empty list while the constructor cons represents a pair containing an element
of *A* and another list.

This inductive definition of lists allows us to recursively define
operations over it. For example, let’s construct a function that sums
all the elements in a list of integers. Its signature is sum : *L*(ℤ) → ℤ. If the list is
empty, the sum returns zero. Otherwise, it takes the first element and
adds it to the sum of what remains, $$
\begin{aligned}
\operatorname{\mathrm{sum}}(\operatorname{\mathrm{nil}}) &= 0, \\
\operatorname{\mathrm{sum}}(\operatorname{\mathrm{cons}}(x, l)) &= x
+ \operatorname{\mathrm{sum}}(l).
\end{aligned}$$ This last definition conceals an extremely useful
pattern. It is the programming analogue of a proof by induction over the
list. The empty list is the base case, which we set to zero. On a list
of length *n* > 0, we assume
that we’ve already solved the computation for the sublist of length
*n* − 1 and them add it to the
remaining element. Just like an inductive proof, see?

As it stands, you would be right to guess that constructing functions
in this way is so pervasive in programming as proofs by induction are in
mathematics. A simple variation of sum
would be a function that multiplies a list of real numbers, prod : *L*(ℝ) → ℝ. To construct it,
all we need to do is take the definition of sum , replace 0 by 1,
replace addition by multiplication, and it is all done! $$ \begin{aligned}
\operatorname{\mathrm{prod}}(\operatorname{\mathrm{nil}}) &= 1, \\
\operatorname{\mathrm{prod}}(\operatorname{\mathrm{cons}}(x, l)) &=
x \cdot \operatorname{\mathrm{prod}}(l).
\end{aligned}$$ As expected, this pattern appears every time we
want to somehow combine the elements of a list into a single value. It
is so common that people have abstracted it on a function called reduce ^{4}. Its type signature is
reduce : *B* × (*A*×*B*→*B*) → (*L*(*A*)→*B*).
Pretty scary, right? Let’s break it down to see how it is just good ol’
induction. When reducing a list of type *L*(*A*) into a value of type
*B*, there are two situations we
may encounter. In the base case the list is empty, and we must specify a
value of type *B* to be
returned. In the other steps, we consider that we already know the
solution for the sublist of length *n* − 1, which yields a value of type
*B*, and then need a rule to
combine this value of the remaining term of type *A* into another term of type *B*. Thus, reduce is an instance of what we call a
*higher-order function*. It is a machine that takes an initial
value and a binary function and outputs another function, now defined on
lists. We call it higher-order because it doesn’t work with ordinary
data, no, it transforms simple functions into recursive ones!
Considering what it does, its definition is actually rather simple. The
result *h* = reduce (*v*,*g*)
is the function that does $$ \begin{aligned}
h(\operatorname{\mathrm{nil}}) &= v, \\
h(\operatorname{\mathrm{cons}}(x,l)) &= g(x,h(l)).
\end{aligned}$$ Take a moment to absorb this definition, there
really is a lot encapsulated on this. First substitute *v* by 0 and *g* by + to see that it becomes sum . Then, substitute *v* by 1 and *g* by ⋅ to see that it becomes prod . Finally, congratulate yourself because
you just understood your first recursion scheme!

If the way reduce was introduced
made you think that it is only used to collapse a list into a primitive
type, you should know that it is much more ubiquitous than that. Two of
the most used list-manipulating functions in functional programming are
map and filter . And, guess what, both are
implementable in terms of reduce . The
first, map , takes a function *f*: *A* → *B* and turns
it into another function that applies *f* elementwisely to a list. Its type
signature for map is therefore map : (*A*→*B*) → (*L*(*A*)→*L*(*B*)).
On the empty list, map (*f*)
does nothing since it has no elements. On a cons node, it should apply *f* to the element stored on it and
then proceed to apply *f* to the
list’s tail. Thus, the definition of map in terms of reduce is $$
\begin{aligned}
g(x, l) &= \operatorname{\mathrm{cons}}(f(x), l), \\
\operatorname{\mathrm{map}}(f) &=
\operatorname{\mathrm{reduce}}(\operatorname{\mathrm{nil}}, g).
\end{aligned} $$ If it is hard to wrap your head around this last
definition, try doing it step by step with a simple function such as
*x*^{2} and a small list
such as (1,2,3,4). I promise you it
will be enlightening.

The filter function takes a
predicate, represented as *p*: *A* → `B`

`o`

`o`

`l`

,
and outputs a function that filters a list. That is removes all elements
of the list for which *p* is
false. Its type signature is thus filter : (*A*→`B`

`o`

`o`

`l`

) → (*L*(*A*)→*L*(*A*)).
Since the empty list has no elements, there is nothing to do in this
case. In a cons (*x*,*l*), we should
return it exactly if *p*(*x*) is true and return
just *l* otherwise. We can
assembly this in terms of reduce as
$$ \begin{aligned}
h(x, l) &= \begin{cases}
\operatorname{\mathrm{cons}}(x,l),& \text{if } p(x) =
\mathop{\mathrm{true}}\\
l,& \text{if } p(x) = \mathop{\mathrm{false}}
\end{cases}\\
\operatorname{\mathrm{reduce}}(f) &=
\operatorname{\mathrm{reduce}}(\operatorname{\mathrm{nil}}, h).
\end{aligned}$$

I hope you liked these examples because the next step in our journey is generalizing reduce to any inductive data type! To achieve this, we must pass through the land of category theory.

While the previous sections dealt more with programming, this one leans more to the mathematical side. But be assured that the fruits of all this abstraction will pay well.

A nice thing about types and functions between them is that they form
a *category*.^{5} In fact, with all the structure we
defined, they have the much more powerful structure of a Cartesian
closed category. But we aren’t going to use this last part today. If
you happen to not know what a category is, I’m afraid I can’t properly
introduce them in here.^{6} But I will go through the basics for
the sake of completeness.

A category 𝒞 may be imagined as a
special kind of graph. There is a collection of vertices, called its
*objects* and between each pair of vertices there is a collection
of *directed edges*, called *arrows* or
*morphisms*. The *A* and
*B* are objects, the notation
for an arrow *r* from *A* to *B* is the same as for functions,
*r*: *A* → *B*.
Soon we will see why. What differentiates a category from a simple
directed graph is that we have a notion of *composition* between
the arrows. That is, if *f*: *A* → *B* and *g*: *B* → *C*, there
always exists another arrow *g* ∘ *f*: *A* → *C*.
You may think of arrows as representing paths and *g* ∘ *f* as the arrow path
going through *f* and then
through *g*.

```
{ [scale=2.5, obj/.style={circle, minimum size=3.5pt, inner sep=0pt, outer sep=0pt, fill, draw=black},
morph/.style={-{Stealth[length=1.5mm]}, thin, shorten >= 0.5mm, shorten <= 0.5mm}]\node[obj, fill=green!30] (A) at (0,0) {};
\node[obj, fill=orange!50] (B) at (1,-1) {};
\node[obj, fill=green!70!black] (C) at (1.5,1) {};
\node[obj, fill=red!30!blue!50] (D) at (2.5,0.6) {};
\node[obj, fill=purple!70] (E) at (2.3,-0.5) {};
\node[obj, fill=yellow!90!black] (F) at (3,0.8) {};
\draw[morph] (A) edge (B);
\draw[morph] (B) edge (E);
\draw[morph] (C) edge [bend right] (B);
\draw[morph] (D) edge [bend left=10] (E);
\draw[morph] (D) edge [bend right=10] (C);
\draw[morph] (E) edge [in=270, out=10, looseness=30] (E);
\draw[morph] (D) edge [bend right=50] (F);
\draw[morph] (D) edge [bend left=40] (F);
\draw[morph, teal] (A) .. controls (1, -1.7) .. (E);
\draw[morph, teal] (D) edge [bend right] (B);
}
```

To call 𝒞 a category, composition
must also satisfy two laws. First, it must be associative. That is, for
any composable triple of arrows, *h* ∘ (*g*∘*f*) = (*h*∘*g*) ∘ *f*.
Second, for each object *X*, a
special arrow id_{X}: *X* → *X*
must exist, called the *identity* for *X*, with the special property that
for any *f*: *A* → *B*, *f* ∘ id_{A} = id_{B} ∘ *f* = *f*.

Now, back to types. Probably the most striking feature of functional
programming is that it is stateless. That is, we don’t think of
functions as a series of procedures mutating some registers, we think of
them as mathematical functions and as such, we must have that *f*(*x*) returns the exact same
value no matter when we execute it. For us, the main consequence of this
is that function composition becomes associative. Thus, there is a
category `T`

`y`

`p`

`e`

`s`

whose objects are types and arrows are functions between types.
Composition is the usual for functions and the identities are the
identity functions id_{A}(*x*) = *x*.

```
\def\N{\mathbb{N}}
\def\Z{\mathbb{Z}}
\def\C{\mathbb{C}}
\def\R{\mathbb{R}}
\def\op#1{\operatorname{\mathrm{#1}}}
\begin{scope} [scale=2.5,
morph/.style={-{Stealth[length=1.5mm]}, thin, shorten >= 0.5mm, shorten <= 0.5mm}]\node (A) at (0,0) {$L(\Z)$};
\node (B) at (1,-1) {$L(\R)$};
\node (C) at (1.5,1) {$(A \to \mathtt{maybe} B)$};
\node (D) at (2.5,0.6) {$\N$};
\node (E) at (2.3,-0.5) {$\mathbb{R}$};
\node (F) at (3,0.8) {$\mathtt{Bool}$};
\draw[morph] (A) edge node[above, sloped] {\small $\op{map}\,\sqrt{\,\cdot\,}$} (B);
\draw[morph] (B) edge node[above, sloped] {\small $\op{prod}$} (E);
\draw[morph] (C) edge [bend right] node[left, midway] {\small $g$}(B);
\draw[morph] (D) edge [bend left=10] node[right] {\small $\sqrt{\,\cdot\,}$} (E);
\draw[morph] (D) edge [bend right=10] node[above] {\small $f$} (C);
\draw[morph] (E) edge [in=300, out=10, looseness=5] node[right] {\small $\op{id}$} (E);
\draw[morph] (D) edge [bend left=50] node[above, sloped] {\small odd} (F);
\draw[morph] (D) edge [bend right=40] node[below, sloped] {\small $\op{even}$} (F);
\draw[morph] (A) .. controls (1, -1.7) .. node [below] {\small $\op{prod} \circ (\op{map}\, \sqrt{\,\cdot\,})$} (E);
\draw[morph] (D) edge [bend right] node [midway, above, sloped] {\small $g \circ f$} (B);
\end{scope}
```

Category theory is not only concerned with categories but also with
transformations between them. As in most of mathematics, we study the
transformations that preserve the structure of composition and identity,
called *functors*. More precisely, a functor from 𝒞 to 𝒟 is
comprised of a function taking objects of 𝒞 to objects of 𝒟 and a function taking arrows of 𝒞 to arrows of 𝒟 (normally both denoted by *F*) such that composition and
identities are preserved, $$ \begin{aligned}
F(g \circ f) &= F(g) \circ F(f) \\
F(\operatorname{\mathrm{id}}) &= \operatorname{\mathrm{id}}.
\end{aligned}$$ As an example, the type operator 𝚖𝚊𝚢𝚋𝚎 is a functor from `T`

`y`

`p`

`e`

`s`

to itself when coupled with the operation on arrows $$ \begin{aligned}
\mathop{\mathrm{\mathtt{maybe}}}(f)(\operatorname{\mathrm{nothing}})
&= \operatorname{\mathrm{nothing}} \\
\mathop{\mathrm{\mathtt{maybe}}}(f)(\operatorname{\mathrm{just}}(x))
&= \operatorname{\mathrm{just}} (f(x)).
\end{aligned} $$ More generally, if a type operator *G* is defined using only products,
coproducts, other functors, and function types with the variable
appearing only on the right-hand side there is a canonical way to turn
it into a functor from `T`

`y`

`p`

`e`

`s`

to itself. Some compilers, such as Haskell’s GHC can
even do that automatically for us. Instead of going through the
constructions gory details, I think doing an example is better to
elucidate how the method. Say we have a type operator $$\begin{aligned}
G X = &\operatorname{\mathrm{pig}} A \\
\mid &\operatorname{\mathrm{yak}} A \times X \times
(\mathop{\mathrm{\mathtt{maybe}}}X) \\
\mid &\operatorname{\mathrm{cow}} B \times (C \to X).
\end{aligned}$$ To turn it into a functor, we define an action on
functions as $$ \begin{aligned}
(G\,f)(\operatorname{\mathrm{pig}}(a)) &=
\operatorname{\mathrm{pig}}(a) \\
(G\,f)(\operatorname{\mathrm{yak}}(a, x, m)) &=
\operatorname{\mathrm{yak}}(a, f(x),
\mathop{\mathrm{\mathtt{maybe}}}(f)(m)) \\
(G\,f)(\operatorname{\mathrm{cow}}(b, g)) &=
\operatorname{\mathrm{cow}}(c, f \circ g)
\end{aligned} $$ Essentially, what *G* *f* does is to unwrap each
constructor and pass *f* to each
of its arguments according to one of the following rules:

- If the term if of type
*X*, apply*f*to it. - If the term is of type (
*A*→*X*), compose*f*with it. - If the term is of type
*F**X*for some functor*F*, apply*F**f*to it. - Otherwise, just return the term itself.

Now we know that both the natural numbers and lists aren’t just the
fixed points of ordinary type operators but of functors from `T`

`y`

`p`

`e`

`s`

to itself. But is there anything special in this? As you may expect, the
answer is *yes*. If a type is the fixed point of a functor, then
we can define recursion schemes for it that work very much like
structural induction. Since this construction works on any category, it
won’t bite to do that in full generality. We begin by fixing a category
𝒞 and a functor *F* from 𝒞 to itself. From *F*, we will construct an auxiliary
category, called the category of * F-algebras*.

A *F*-algebra is an object
*X* of 𝒞 together with an arrow *f*: *F**X* → *X*.
Morphisms between *F*-algebras
*f*: *F**X* → *X*
and *g*: *F**Y* → *Y*
are given by an arrow *h*: *X* → *Y* such that
the following diagram commutes

```
F X \ar[d, "f"'] \ar[r, "F h"] & F Y \ar[d, "g"] \\
X \ar[r, "h"'] & Y
```

You can check that this definition turns *F*-algebras into a category where the
identity for *f*: *F**X* → *X*
is given by id_{X}
itself. On this category, we are interested in a special kind of object
called an *initial F-algebra*. That is a

Any initial *F*-algebra in : *F**I* → *I* is an
isomorphism in 𝒞.

Before we prove this theorem, notice that it implies that *I* is a fixed point of *F*! It’s not every fixed point that
defines an initial algebra, but the just the ones that are “smallest” in
a certain sense.^{7} Nevertheless, finite length
inductive data structures are generally initial.

First, notice that if in : *F**I* → *I* is a
*F*-algebra, so is *F*in : *F*(*F**I*) → *F**I*.
Since in is initial, there is a unique
arrow *g*: *I* → *F**I*
making the following diagram commute

```
F I \ar[d, dashed, "f"'] \ar[r, "F g"] & F (FI) \ar[d, "F \operatorname{in}"] \\
I \ar[r, dashed, "g"'] & Y
```

Since in itself may be viewed as a
morphism between the *F*-algebras *F*in and in , we get that their composition is a
morphism from in to itself represented
by the following diagram, where all paths still commute

```
F I \ar[r, dashed, "F g"] \ar[d, "\operatorname{in}"'] & F(F I) \ar[r, "F \operatorname{in}"] \ar[d, "F \operatorname{in}"] & F I \ar[d, "\operatorname{in}"] \\
I \ar[r, dashed, "g"'] & F I \ar[r, "\operatorname{in}"'] & I
```

Since in is initial, the unique
arrows going from it to itself is the identity. Thus, we must have *g* ∘ in = id . Moreover, from the
definition of functor and the previous diagram, in ∘ *g* = (*F**g*) ∘ (*F*in) = *F*(*g*∘in) = *F*(id) = id .
Therefore, *g* is an inverse to
in , concluding the proof.

We finally have the necessary tools to define our first recursion
scheme in all its glory and generality. Consider, just as in the
previous section, a functor *F*
from `T`

`y`

`p`

`e`

`s`

to itself and call its initial algebra in : *F**A* → *A*. Given
a *F*-algebra *f*: *F**X* → *X*,
its *catamorphism*, which we will denote by cata *f*, is the unique arrow from
*A* to *X* given by the initially of in .

Before proceeding, I must allow myself a little rant: sometimes mathematicians are just terrible name-givers. Like, what in hell is a catamorphism? What kind of intuition should it elicit? Well… as a matter of fact the name makes a lot of sense if you happen to be speaking in ancient Greek. Since, despite the arcane name, the catamorphism is a super cool concept, I think it deserves that we do a little etymological break to explain where its name comes from.

The word catamorphism has two parts, *cata-* +
*morphism*. The later comes comes from the Greek ‘μορφή’ and means “form”
or “shape”. Its use is common in category theory and stems from the
fact that usually, the arrows in a category are structure (or shape)
preserving functions. For example, in the category of *F*-algebras, the morphisms are
functions that commute with the algebras, thus “preserving” its
application. The prefix “cata-” comes from the Greek ‘κατά’ and means something like a “downward
motion”. It is the same prefix as in “cataclysm” or “catastrophe”
and this is the intuition we shall keep in mind. If you think of an
inductive data structure as a great tower piercing the sky, the
catamorphism is a divine smite that, in a catastrophic manner, collapses
the tower into a single value.^{8}

Ok, back to mathematics.

As you may have imagined, the catamorphism is a generalization of the
reduce operator. The difference is
that while reduce only works for
lists, cata can collapse the initial
algebra of any functor. But, in all this generality, how do we actually
compute a catamorphism? Let’s start by taking a look at its type
signature: cata : (*F**X*→*X*) → (*A*→*X*).
It is a higher-order operator that takes a function from *F**X* to *X* and turns it into another
function, now from the fixed point *A* to *X*. Intuitively, what it encapsulates
is that if we know how to collapse one step of a data structure, then we
can use structural induction to collapse the entire structure.

To properly calculate cata , let’s take a look at the commutative diagram defining it,

```
F A \ar[d, "\operatorname{in}"] \ar[r, "F(\operatorname{cata} f)"] & F X \ar[d, "f"] \\
A \ar[u, bend left=45, "\operatorname{in}^{-1}"], \ar[r, dashed, "\operatorname{cata} f"'] & X
```

In this diagram, there are two ways to go from *A* to *X*. From its commutativity, we know
that they are equal. Hence, cata must
satisfy cata *f* = *f* ∘ *F*(cata*f*) ∘ in^{−1}.
So, that’s a pretty concise formula. Let’s unpack it with some examples
before trying to give a full explanation.

The first inductive data type we encountered where the natural
numbers, defined as the least fixed point of *F**X* = zero ∣ succ *X*.
For it, the initial algebra and its inverse are defined as $$\begin{aligned}
\operatorname{\mathrm{in}}(\operatorname{\mathrm{zero}}) &= 0, \\
\operatorname{\mathrm{in}}(\operatorname{\mathrm{succ}}n) &= n+1,
\end{aligned}\quad
\begin{aligned}
\operatorname{\mathrm{in}}^{-1}(0) &= \operatorname{\mathrm{zero}},
\\
\operatorname{\mathrm{in}}^{-1}(n) &=
\operatorname{\mathrm{succ}}(n-1).
\end{aligned}$$ If these definitions seem too obvious, remember
that the ℕ are the natural numbers as
we think of them everyday while zero
and succ are the formal constructors
of the functor *F*. Keeping
track of this boilerplate is essential.

As our first example, let’s construct the function exp : ℕ → ℝ, which takes a natural *n* to the real number *e*^{n}, as a
catamorphism. Exponentiation is recursively defined by the equations
$$\begin{aligned}
e^{0} &= 1 \\
e^{n+1} &= e \cdot e^n.
\end{aligned}$$ In terms of an algebra *f*: *F*ℝ → ℝ, this means that
on the constructor zero we must return
1 while in each succ step, we must return *e* times the already accumulated
value. Then, exp = cata *f*,
where $$\begin{aligned}
f(\operatorname{\mathrm{zero}}) &= 1 \\
f(\operatorname{\mathrm{succ}}x) &= e \cdot x.
\end{aligned}$$ Let’s use the catamorphism formula to unwrap the
calculations of exp . First, *i**n*^{−1} writes a
number as the succ of its predecessor.
Then, the way we defined the functor action of a data type means that
*F*(cata*f*) unwraps
succ *n* to apply cata *f* to the natural number *n* stored inside it and then
reapplies the constructor, leaving us with succ ((cata*f*)(*n*)). This
process recursively continues until we encounter a zero . But this time *F*(cata*f*) *does not*
reapplies cata *f*; instead, it
returns zero as is. At this point, our
call stack is consists of the function *f* applied exactly *n* times to the constructor succ applied exactly *n* times to zero . For example, let’s take a look at the
traceback of calling exp (2): $$\begin{aligned}
\exp(2) &= (f \circ F(\operatorname{\mathrm{cata}}f) \circ
\operatorname{\mathrm{in}}^{-1})(2) \\
&=
f(F(\operatorname{\mathrm{cata}}f)(\operatorname{\mathrm{in}}^{-1}(2)))
\\
&=
f(F(\operatorname{\mathrm{cata}}f)(\operatorname{\mathrm{succ}}1)) \\
&=
f(\operatorname{\mathrm{succ}}(\operatorname{\mathrm{cata}}f(1))) \\
&=
f(\operatorname{\mathrm{succ}}(f(F(\operatorname{\mathrm{cata}}f)(\operatorname{\mathrm{in}}^{-1}(1)))))
\\
&=
f(\operatorname{\mathrm{succ}}(f(F(\operatorname{\mathrm{cata}}f)(\operatorname{\mathrm{succ}}0))))
\\
&=
f(\operatorname{\mathrm{succ}}(f(\operatorname{\mathrm{succ}}(\operatorname{\mathrm{cata}}f
(0)))) \\
&=
f(\operatorname{\mathrm{succ}}(f(\operatorname{\mathrm{succ}}(f(F(\operatorname{\mathrm{cata}}f)(\operatorname{\mathrm{in}}^{-1}(0)))))))
\\
&=
f(\operatorname{\mathrm{succ}}(f(\operatorname{\mathrm{succ}}(f(F(\operatorname{\mathrm{cata}}f)(\operatorname{\mathrm{zero}}))))))
\\
&=
f(\operatorname{\mathrm{succ}}(f(\operatorname{\mathrm{succ}}(f(\operatorname{\mathrm{zero}})))))
\\
&=
f(\operatorname{\mathrm{succ}}(f(\operatorname{\mathrm{succ}}1))) \\
&= f(\operatorname{\mathrm{succ}}(e\cdot 1)) \\
&= e \cdot (e \cdot 1).
\end{aligned}$$ Ok, these were a lot of parentheses. However, if
you actually followed that mess above, the catamorphism’s pattern should
be clear by now.

As a final example of catamorphism, let’s write a little calculator
that supports addition, multiplication and exponentiation by a natural.
A calculator should take an arithmetic expression and return a real
number. We define an expression recursively. It may be a real number,
the sum of two other expressions, the multiplication of two expressions,
or an expression raised to a natural exponent. This is represented by a
type `E`

`x`

`p`

`r`

which is the least fixed point of the functor $$\begin{aligned}
F\,X = &\operatorname{\mathrm{const}}\, \mathbb{R}\\
\mid &\operatorname{\mathrm{add}}\, X \times X \\
\mid &\operatorname{\mathrm{mult}}\, X \times X \\
\mid &\operatorname{\mathrm{pow}}\,X \times\mathbb{N}.
\end{aligned}$$

To evaluate an expression, we need an appropriate *F*-algebra *f*: *F*ℝ → ℝ. As with natural
numbers, the idea in here is to treat the constructor const where *X* don’t appear as a base case and to
think of the others constructors as storing an already solved problem on
the *X*. With this in mind, the
evaluator *F*-algebra is $$\begin{aligned}
f(\operatorname{\mathrm{const}}(a)) &= a \\
f(\operatorname{\mathrm{add}}(x,y)) &= x + y \\
f(\operatorname{\mathrm{mult}}(x,y)) &= x \cdot y \\
f(\operatorname{\mathrm{pow}}(x,n)) &= x^n.
\end{aligned}$$ And the evaluator eval : `E`

`x`

`p`

`r`

→ ℝ
is just cata *f*.

Another application of a catamorphism is if instead of evaluating the
expression, we want to print it as a string. Let’s say we have a method
str that converts numbers to strings
and an operator ⋄ that concatenates two
strings. Erring on the side of too many parentheses, a candidate *F*-algebra is $$\begin{aligned}
g(\operatorname{\mathrm{const}}(a)) &=
\operatorname{\mathrm{str}}(a) \\
g(\operatorname{\mathrm{add}}(x,y)) &= x \diamond
\mathtt{"\mathord{+}"} \diamond y \\
g(\operatorname{\mathrm{mult}}(x,y)) &= \mathtt{"("} \diamond x
\diamond \mathtt{")\mathord*("} \diamond y \diamond \mathtt{")"} \\
g(\operatorname{\mathrm{pow}}(x,n)) &= \mathtt{"("} \diamond x
\diamond \mathtt{")\char`\^"} \diamond \operatorname{\mathrm{str}}(n).
\end{aligned}$$ As you probably already expect, the function
show : `E`

`x`

`p`

`r`

→ `S`

`t`

`r`

`i`

`n`

`g`

that converts an expression to a string is just cata *g*.

Least fixed points are recursive data structures. The catamorphism
abstracts the process of transforming these structures into another type
via structural induction. All recursion occurs inside the formula for
cata . If you just use it as a black
box that turns functions *F**X* → *X* into
functions *A* → *X*
(where A is the fixed point), you will never actually see the
recursion’s autoreference. The operator cata abstracts recursion just like a
`for`

loop abstracts the `goto`

away. It is still
there but following a fixed structure.

Since cata is a more rigid
structure than general recursion, it is easier to reason about it. For
example, one of the strongest properties of catamorphisms is that if the
recursive data structure is finite and *f* is total function, then cata *f* is also guaranteed to
eventually stop; As a lot of applications use finite data, this
facilitates a lot the act of debugging a program.

Collapsing a data structure into a value has many use cases but is
not the end of the story. Sometimes we want to do the exact opposite:
take a value as a seed to construct a list or another structure from it.
This notion is dual to the catamorphism and, of course, there is also a
recursion scheme to encapsulate it: the *anamorphism*.^{9}

Again we have a pretty arcane name in our hands. Let’s take a look at
its etymology in order to clarify things. The prefix *ana-* comes
from the Greek ‘ἀνα’ and generally means “upward
motion” but is also used to mean “strengthening” or “spreading all
over”. It is the same prefix of *analogy* and of *anabolic
steroids*. In both these words, the prefix means that we are
building something up.

In order to describe the catamorphism in all its generality and yet with a concise formula, we had to dive into some categorical abstract nonsense. As you might expect, the same applies to the anamorphism. Fortunately, there is a tool in category theory called duality. If we simply reverse the arrows in a theorem, its conclusion still holds but also with some arrows reversed.

Given a category 𝒞 and a functor
*F*: 𝒞 → 𝒞, we define a *F*-coalgebra to an object *X* of 𝒞 together with an arrow *f*: *X* → *F**X*.
The *F*-coalgebras form a
category where the morphisms between *f*: *X* → *F**X*
and *g*: *Y* → *F**Y*
are arrows *h*: *X* → *Y* such that
the diagram below commutes.

```
F X \ar[r, "F h"] & F Y \\
X \ar[u, "f"] \ar[r, "h"] & Y \ar[u, "g"']
```

The dual notion to an initial object *F*−algebra is a *terminal*
*F*-coalgebra. Namely, an
algebra out : *S* → *F**S* such
that there is a unique morphism from any other *F*-coalgebra to out . As you probably already expect,
terminal coalgebras are always isomorphisms. The proof is essentially
the same as the one for initial *F*-algebras. The only difference
being some arrows reversed.

Any terminal *F*-coalgebra
out : *C* → *F**S*
is an isomorphism.

The most direct consequence of this theorem is that *S* must be a fixed point of *F*. Since there is an arrow from
every other coalgebra to *S*, it
is called the *greatest fixed point* of *F*. For this presentation, there is
no need to discriminate between least and greatest fixed points.
Specially because there are languages such as Haskell where they are the
same. Just keep in mind that when working with coalgebras, there is no
guarantee that functions must eventually stop. We are no longer on the
land of finite data structures but in the much greater land of possibly
infinite data structures.

Given a *F*-coalgebra *f*: *X* → *F**X*,
its anamorphism ana *f* is
defined as the unique arrow from *f* to the terminal object

```
F S \ar[d, bend right=45, "\operatorname{out}^{-1}"'] & F X \ar[l, "F(\operatorname{ana} f)"'] \\
S \ar[u, "\operatorname{out}"'] & X \ar[u, "f"'] \ar[l, dashed, "\operatorname{ana} f"]
```

Thus, the type signature of ana
viewed as a higher-order operator is ana : (*X*→*F**X*) → (*X*→*S*).
It turns a coalgebra, which we may view as one construction step, into a
function that constructs an entire inductive data structure. If we pay
attention to the commutative diagram above, we can see that there is a
concise recursive formula defining ana , ana *f* = out^{−1} ∘ *F*(ana*f*) ∘ *f*.
This is the same formula we had for cata , but the order of composition reversed!
This reversion, however, means that we are building structures instead
of collapsing them.

Let’s do some examples with lists. Recall that the type *L*(*A*) of lists with elements
of type *A* is a fixed point of
the functor *P**X* = nil ∣ cons *A* × *X*.
Using the usual notation [*a*,*b*,*c*,…] to
represent lists, the *F*-coalgebra out is defined as $$ \begin{aligned}
\operatorname{\mathrm{out}}([\;]) &= \operatorname{\mathrm{nil}}\\
\operatorname{\mathrm{out}}([a,b,c,\ldots]) &=
\operatorname{\mathrm{cons}}(a, [b,c,\ldots]).
\end{aligned}$$ One of the simplest list anamorphisms is a
function that receives a natural number *n* and returns a decreasing list from
*n* to 1. It is defined as ana *g*, where *g* is the coalgebra $$ g(n) = \begin{cases}
\operatorname{\mathrm{nil}},& n < 1 \\
\operatorname{\mathrm{cons}}(n, n-1),& \text{otherwise}.
\end{cases}
$$

In practice, it is most common to use an increasing list, however.
The induction on this one is a little trickier but nothing an
anamorphism can’t handle. To make it more fun, let’s construct a
function range : ℤ × ℤ → *L*(ℤ)
taking a pair of integers to the closed integer interval between them.
We want range (−2,1) = [−2,−1,0,1] and
range (7,5) = [ ], for example. To
achieve this, we will need a family of coalgebras $$ g_b(n) = \begin{cases}
\operatorname{\mathrm{nil}},& n > b \\
\operatorname{\mathrm{cons}}(n, n+1), & \text{otherwise}.
\end{cases}$$ Then, our range function is the anamorphism range (*a*,*b*) = (ana*g*_{b})(*a*).

If you ever dealt with plotting libraries, you are probably familiar
with a `linspace`

method. It receives two real arguments
*a*, *b*, and a natural argument *n* and returns a list of *n* points uniformly distributed
between *a* and *b*. The construction of such a
function linspace : ℝ × ℝ × ℕ → *L*(ℝ) is a
simple variation on the range we
defined earlier. We start with a family of coalgebras $$ g(b, n)(x) = \begin{cases}
\operatorname{\mathrm{nil}},& x > b \\
\operatorname{\mathrm{cons}}(x, x + \frac{b-a}{n}), &
\text{otherwise},
\end{cases}$$ and define it to be the anamorphism linspace (*a*,*b*,*n*) = (ana*g*(*b*,*n*))(*a*).

To end this section, let’s turn to an example from number theory. We
will construct the sieve of
Eratosthenes.^{10} This algorithm receives a natural
number *n* and returns a list of
all primes below *n*. The idea
is to start with a list of numbers between 2 and *n*, and to recursively refine it
eliminating all multiples of a given prime. Let test (*p*) be the predicate that tests
if a number is not divisible by *p*. You can implement it by testing
if the remainder of the input by *p* equals zero, for example. This
refinement process is encapsulated by the function sieve = ana *g* where *g* is the coalgebra $$ \begin{aligned}
\operatorname{\mathrm{g}}([\;]) &= \operatorname{\mathrm{nil}}\\
\operatorname{\mathrm{g}}([p, x,\ldots]) &=
\operatorname{\mathrm{cons}}(p,
(\operatorname{\mathrm{filter}}(\operatorname{\mathrm{test}}(p))([x,\ldots]))).
\end{aligned}
$$ And our prime-listing function is the composition era (*n*) = sieve (range(2,*n*)).
This process works because after each filtering, the list’s first
element cannot be divisible by any number below it. Thus it must be
prime. Notice that this function eventually stop because as soon as we
reach an empty list, the algorithm returns a nil .

Finally, to show the power of anamorphisms coupled with lazy
evaluation, let’s use sieve to compute
a list of all primes. We begin by writing a list *l* of all natural numbers starting at
2, $$\begin{aligned}
h(x) &= \operatorname{\mathrm{cons}}(x, x+1) \\
l &= (\operatorname{\mathrm{ana}}h)(2).
\end{aligned}$$ This list must be infinite because, since *h* never returns nil , the anamorphism recurses forever.
Unsurprisingly, the list of all primes is defined as primes = sieve (*l*). At first, this
may not seem very useful since any computer would take infinite time to
calculate primes . But, after some
thought, you will notice that the way sieve calculates its output is rather
special. After it produces an element of the list, it never touches that
element again! Thus, although it’s impossible to properly calculate
primes , we have that for any natural
*N*, the first *N* elements of primes are calculated in a finite amount of
time. So you can think of primes as an
iterator that generates era (*n*) in finite time.

This last property is common to all anamorphisms. Although their output can be a possible infinite data structure (even if the input is finite), it is produced in an extremely structured manner. When a value is computed, we go to the next and it is never touched again. Therefore, we can still use an anamorphism to calculate finite data so long as we properly say when we wish to stop. In fact, if we take a look at programs such as games or operational systems, this is exactly the behaviour we want from them. Imagine if any OS terminated after a finite amount of steps… No, what we want is for the OS to run indefinitely producing some well-defined actions in finite time.

By now, we’ve seem two kinds of recursion schemes. Anamorphisms start with a seed value and end with a data structure constructed from it, while catamorphisms start with a data structure and collapse it to end with result. Both of these are powerful tools but their type signatures are too constrained. They must explicitly return or receive a data structure. What if we want to write a recursive function from a primitive value to another primitive value? Our next (and last for today) recursion scheme addresses exactly this.

Our strategy will be to use a inductive type as our middle man. The
*hylomorphism* takes a value, applies an anamorphism to turn it
into a data structure and then applies a catamorphism to collapse it
into another value.^{11} This means that we don’t need to go
through yet another category to define it. No, given a functor *F*, the type signature for hylo is hylo : (*F**B*→*B*) × (*A*→*F**A*) → (*A*→*B*)
and the definition is simply^{12} hylo *f* *g* = cata *f* ∘ ana *g*.

The etymology for hylomorphism is a little harder to motivate but
let’s try anyway. The prefix *hylo-* comes from the Greek ‘ὕλη’ and means “wood”
or “matter”. And as we’ve previously seem, the term morphism comes
from the Greek word for form. The name hylomorphism is a pun on an
Aristotelian theory of the same name which says that the being is
composed from matter and form. Since I never really understood
Aristotle’s writings and was unable to find another word starting with
*hylo-* outside the context of philosophy, I will just try to
improvise an intuition. Let’s think of algebras and coalgebras as ways
to give form or create matter and a hylo combines them into a single being. It’s
kinda lame but was the best I could think of.

Although the first recursive function introduced in this post was the
factorial (way back in the motivation) we
haven’t rewritten it as a recursion scheme yet. It’s time to fix that.
If you open the definition of fat , you
will see that fat (*n*) stand
for the product of the first *n*
natural numbers, $$
\operatorname{\mathrm{fat}}(n) = \prod_{k=1}^n k.$$ Thus, an
algorithm to compute the factorial is to first construct a decreasing
list of integers from *n* to
1 and then collapse it by multiplying
all its elements. We’ve already constructed both these functions but
let’s rewrite their definitions for the sake of completeness. We start
with the coalgebra and algebra, $$\begin{aligned}
g(n) &= \begin{cases}
\operatorname{\mathrm{nil}},& n < 1 \\
\operatorname{\mathrm{cons}}(n, n-1),& \text{otherwise},
\end{cases} \\
f(\operatorname{\mathrm{nil}}) &= 1, \\
f(\operatorname{\mathrm{cons}}(x,y)) &= x \cdot y,
\end{aligned}$$ and finally define the factorial as fat = hylo *f* *g*.

Time for a more complex example, let’s see how to use hylomorphisms
to better shape the call procedure tree and get faster algorithms.
Recall how we defined exp as a
catamorphism. For what we knew at the time, it was fine but, at its
*O*(*n*) complexity, it’s
just too slow. With a hylomorphism, we can do better than that. The
trick is noticing that for even values, the exponential satisfies
*e*^{2n} = (*e*^{n/2})^{2},
which gives us a much better recursive relation. Thus, instead of
multiplying *e* *n* times, we can construct a much
smaller call tree and collapse it to get the desired result. We define a
type `C`

`a`

`l`

`l`

`T`

`r`

`e`

`e`

as the fixed point of the functor *T**X* = leaf ℝ ∣ square *X* ∣ mult ℝ × *X*.
This type encapsulates the call tree. As base cases, *n* equal to zero or one means that we
store *e*^{n}
on a leaf node. If *n* is even, we construct a square node and pass *n*/2 to it, meaning that the value of
this node will be squared on the algebra. Finally, if *n* is odd, we store *e* and *n* − 1 on this node, which will be
multiplied when folding with the algebra. The coalgebra that construct
this structure is $$
g(n) = \begin{cases}
\operatorname{\mathrm{leaf}}(1),& n=0 \\
\operatorname{\mathrm{leaf}}(e),& n=1 \\
\operatorname{\mathrm{square}}(\frac{n}{2}),& n \text{ is even}
\\
\operatorname{\mathrm{mult}}(e, n-1),& n \text{ is odd}.
\end{cases}
$$ To collapse the tree, we will use the algebra $$\begin{aligned}
f(\operatorname{\mathrm{leaf}}\,x) &= x \\
f(\operatorname{\mathrm{square}}(x)) &= x^2 \\
f(\operatorname{\mathrm{mult}}(x,y)) &= x \cdot y.
\end{aligned}$$ Finally, we define the exponential as the
hylomorphism exp = hylo *f* *g*.

When we analyzed the traceback for the exponential as a catamorphism,
we noticed that it did a multiplication for each succ that appeared until we reached zero . Since a natural number *n* is the *n*-th successor of zero, this amounts
to *n* multiplications. Viewing
a natural number in such a unary representation has its theoretical
advantages but is too slow and cumbersome in practice. With our new
definition using a hylomorphism, we let the binary representation of
*n* aid us in designing a more
efficient call tree. Since the number of multiplications necessary is
proportional to log_{2}*n*, we get an
algorithm with complexity *O*(log*n*). This is much
better! If you’re not impressed, take a look at the call tree generated
for *n* = 100 and see how it
requires much less than 100 computations! Better yet, to compute *e*^{200}, this tree would
have to be augmented by only one node.

```
for tree={grow''=east,draw},
[square, for tree={draw}
[square [mult [e]
[square [square [square [mult [e]
[square [e]]]]]]]]]
```

Hylomorphisms really shine when an algorithm implements an intermediate data structure that is used but not returned. A special class of such are the divide-and-conquer algorithms, whose structure lend them to have elegant implementations as hylomorphisms. There are many examples such as merge sort, quick sort, convex hull and many others that can be written as an anamorphism that divides the input in a structured manner followed by a catamorphism that conquers the output from it. Please try to write any of these as a hylo ! If you pick your favorite algorithm, there is a great chance it can be written in a clear manner as a hylomorphism.

To end this section (and this post), we will do a last derivation showing that hylomorphisms do not need to be defined in the way they were. No, they accept a recursive definition in terms of themselves as any good recursion scheme should.

Given a *F*-algebra *f* and a *F*-coalgebra *g*, their hylomorphism satisfies
hylo *f* *g* = *f* ∘ *F*(hylo*f* *g*) ∘ *g*.

Since the least and greatest fixed points of *F* are equal, we can assume that
in and out are inverses to each other. By pasting
together the diagrams defining the anamorphism and the catamorphism we
get

```
F X \ar[r, "F(\operatorname{ana} g)"] & F S \ar[d, bend right=15, "\operatorname{in}"'] \ar[r, "F(\operatorname{cata} f)"] & F Y \ar[d, "f"]\\
X \ar[u, "g"] \ar[r, dashed, "\operatorname{ana} g"'] & S \ar[u, bend right=15, "\operatorname{out}"'] \ar[r, dashed, "\operatorname{cata} f"'] & Y
```

Going from *X* to *Y* through the bottom path amounts
for our definition of hylomorphism. Since the diagram is commutative, it
must be equivalent to trace the top path. This, together with the
functoriality of *F*, implies
$$\begin{aligned}
\operatorname{\mathrm{hylo}}f\,g &= f \circ
F(\operatorname{\mathrm{cata}}f) \circ F(\operatorname{\mathrm{ana}}g)
\circ g \\
&= f \circ F(\operatorname{\mathrm{cata}}f \circ
\operatorname{\mathrm{ana}}g) \circ g \\
&= f \circ F(\operatorname{\mathrm{hylo}}f\,g) \circ g.
\\
\end{aligned}$$ Thus we conclude the proof.

Although our definition of a hylo is good to reason about, this formula has two advantages over it. The first one is practical: because there is less boilerplate happening, we only have half the function calls happening. As you can see in the proof’s second line, our definition requires hylo to first recurse through ana , stacking a pile of calls to in , to immediately thereafter recurse through cata , canceling each in with a out . This does not change the algorithm’s complexity but still hinders its efficiency. In very large programs that run for weeks, even the complexity constant hurts. There is a practical difference between a program that runs in a week and a program that runs in two.

The second reason is more theoretical: This last formula has no reference to fixed points. There is no explicit mention of any data structure whatsoever. All we have to do is give hylo one rule saying how to construct each step and one rule saying how to destruct each step and the call stack becomes our intermediate data structure! This is much cleaner and spare us from thinking about least and greatest fixed points. For short, when thinking about hylomorphisms, use the definition as an anamorphism followed by a catamorphism but when actually implementing one, use this formula instead.

This was a long journey but we finally got to the end. Throughout it we met with our three heroes: catamorphism, anamorphism, and hylomorphism, and I truly hope that the ideas they encapsulate will appear on your future programs.

Although these three are powerful tools, they’re not the only recursion schemes available. Oh no, there is an entire zoo of them and, as you may expect, every single one has an arcane ancient Greek prefix. Each recursion scheme encapsulates a common pattern of recursion and comes in two flavours: construction and destruction of an inductive type. The ones we’ve seem today are the schemes for structural recursion but if instead we wanted primitive recursion, we would use a para- or apomorphism. It is even possible to combine different schemes to end up with monsters such as the infamous zygohistomorphic prepromorphisms. Moreover, as one can expect of anything related to category theory, all of them are special cases of a certain categorical construction.

Another topic I haven’t given the deserved attention are the fusion laws for a recursion scheme. The recursive formulas we derived for each scheme may be seem as certain algebraic equations that they must obey. Besides these, there are some other laws related to composition of schemes or other contexts. These give us some of the guarantees of structured programming we talked earlier. If a compiler knows that the code only use certain recursion schemes, it can use the fusion laws to substitute your code for something semantically equivalent but much more efficient. You get the type safe, organized code and the machine gets the faster, optimized one. Win-win.

Well, that’s all for today. Good recursion for y’all!

This sum example is simple to code with a loop or linear recursion. Recursion schemes really shine when manipulating more complex data structures, such as trees with varying arity.↩︎

This is also called

`Option`

in some languages and is specially useful to safely model partial functions and computations that may fail.↩︎Choosing the letter

*N*was no accident.↩︎Also known as

`accumulate`

or`foldr`

in some languages. The ‘r’ means that this function folds a list from the right.↩︎This is not entirely true. Questions such as non-termination or laziness/strictness may break this structure in some languages, see this link for a discussion concerning Haskell. Nevertheless, types are similar enough to a category to be worth it to think of them as forming one.↩︎

Take a look on chapter 1 of Emily Riehl’s book if you want to learn this (and much more) properly. Trust me, it’s a great book for the mathematically inclined.↩︎

The technical term is

*least fixed point*.↩︎Fortunately, as functional languages have no side-effects, applying a catamorphism doesn’t make everyone start talking in a different language.↩︎

Also called

`unfold`

in the context of lists.↩︎This example is adapted from an anamorphism in the lecture notes Programming with Categories by Brendan Fong, Bartosz Milewski and David I. Spivak.↩︎

As it stands, hylomorphisms are the theoretical analogous of me playing any construction game as a kid.↩︎

To write that, we must assume that the least and the greatest fixed point of

*F*coincide. Although this assumption may sound strange, it always holds for languages such as Haskell.↩︎

Nevertheless, this all changed earlier this year. I was taking a
shower^{1}, not even thinking about math when I
was suddenly struck by a bliss of clarity: *the average minimizes the
squared error*—I said to myself. This seemly simple statement is
present (normally as an exercise) in practically any introductory
probability book, but I had never realized that it implies so much! From
this, we can construct the average not by a smart formula but
geometrically using a variational principle.

Let’s look at it from an information point of view: suppose you are
working with a random variable *X* and you don’t have where to store
all its information. In fact, for the sake of this example, you can only
store *a single number* to represent it. Which number should you
choose? You probably already know it is 𝔼[*X*]… But why? Well… Because it is
in fact the best possible approximation. This post will be a digression
on this theme:

The closest constant to a random variable is its expected value.

Of course, there are some things lacking on the theorem above. How do you measure the distance between random variables? Moreover, constants are numbers and random variables are functions. They have different types! Then, what does it mean to them to be close together? The usual probabilistic view doesn’t emphasize this, but we can interpret random variables geometrically as points in space and, then, measuring their distance becomes as simple as taking the length of the line segment connecting them. Indeed, if we know the random variables, we can even measure this length through nothing more, nothing less than the Pythagorean Theorem!

Let’s begin by laying the groundwork. Suppose that we’re considering
an event with a finite amount (let’s say *N*) of possible outcomes *Ω* = {*ω*_{1}, …, *ω*_{N}},
each occurring with a probability *p*_{i}. You can
think of it as throwing a die, pulling a slot machine’s lever or any
other situation where uncertain stuff may happen.

In this framework, we can think of a *random variable* as a
function that for each possible outcome assigns a real number^{2}, *X*: *Ω* → ℝ. As an example, a
possible random variable is *how much* cash each outcome of a
slot machine gives you. Now comes the **most important
part**: these are just real-valued functions, thus summing them
or scaling by a number always amounts to another random variable:

$$ \begin{aligned} (X + Y)(\omega) &= X(\omega) + Y(\omega) \\ (kX)(\omega) &= k \cdot X(\omega) \end{aligned} $$

This means that our random variables form a *Vector Space*!
For us, the beauty of this is that we can use the tools of good ol’
Euclidean Geometry to understand how they work. And I don’t know for
you, but I always feel more at home when working in the land of Linear
Algebra.

A nice consequence of having only a finite amount of possible
outcomes is that the space of random variables is actually
finite-dimensional! The *indicator functions* of the outcomes
form a basis to this space:

$$ \mathbb{I}_i(\omega_j) = \begin{cases} 1,& i = j \\ 0,& \text{otherwise}. \end{cases} $$

The coefficients of a random variable in this basis are also pretty
straightforward. They are just the values it returns for each outcome.
That is, if *X*: *Ω* → ℝ
returns the value *x*_{i} when we
sample an outcome *w*_{i}, then it is
decomposable as

$$ X = \sum_{j = 1}^N x_j \mathbb{I}_j. $$

In two dimensions, we can also visualize this as a figure:

Because we are interested in a geometrical definition for the mean,
let’s consider a special example: random variables that always return
the same value no matter what outcome we sample. These are called
*constant random variables*, and as you can image from the
introduction, are of uttermost importance for what we are doing in this
post. So, if *X*: *Ω* → ℝ
always returns *c*, it can be
decomposed as

$$ C = \sum_{j = 1}^N c \mathbb{I}_j = \underbrace{c}_{\text{constant}} \cdot \underbrace{\left(\sum_{j=1}^N \mathbb{I}_j\right)}_{\text{direction}}. $$

Or, as a figure:

From this we see that all constants lie on the same line (one-dimensional space), given by the diagonal of all indicators. Remember that we want a method to find the closest constant to a random variable. Intuitively, we can think of this procedure as projecting the random variable into the line of constants. So, let’s proceed by considering how to project vectors into subspaces.

Possibly you noticed that until now, we never used the probabilities. The random variables represent values attached to some non-deterministic outcome however we haven’t used any notion of which outcomes are more probable nor how they relate to each other.

Another thing you may have noticed is that the previous section was
all function algebra, without any real geometry happening. There were no
angles, distances nor anything like that. Very well, this is the time to
fix things up by killing two birds with one stone. Our solution will use
the *probabilities* of the outcomes to define an *inner
product* on the space of random variables.

Now it’s modeling time! How can we embed the probabilistic structure into an inner product? To have an inner product that somehow reflects the probabilities, we will ask that it satisfies some coherence conditions.

We want the inner product of two random variables *X* and *Y* to only depend on the probability
*p*_{i} if both
of them are non-zero for the outcome *ω*_{i}. This
restriction represents that the information for one possible outcome is
only important for this outcome. Some high-level consequences of this
restriction are:

Random variables with disjoint supports (That is, every time one of them is zero, the other is non-zero) are

**orthogonal**.The norm of a random variable only depends on the outcomes for whom it is

**non-zero**.

Now more concretely: how does this shape the inner product? It is
completely determined by how it acts on a basis, so let’s check these
properties for the *indicators*. First, the norm of 𝕀_{j} can only depend on the
probability *p*_{j}. Also, since
they are supported on different outcomes, this definition forces the
𝕀_{j} to form an
orthogonal basis!

$$ \left\langle\mathbb{I}_i, \mathbb{I}_j\right\rangle = \begin{cases} f(p_i), & i = j \\ 0,& \text{otherwise}. \end{cases} $$

Where *f*: ℝ → ℝ is a yet
undetermined function. To totally define this inner product, we must
enforce another property: we require deterministic objects to be unaware
of probabilistic structure. Since the only way for a random variable to
be deterministic is being constant, this translates to: *if C is a constant random variable,
then its norm ∥C∥_{2}
doesn’t depend on any p_{i}.*
Moreover, for the sake of consistency, let’s also require the norm of a
constant random variable to be precisely the value that it always
returns. In math symbols: If for all

$$ \begin{aligned} \left\lVert C\right\rVert_2 &= \sqrt{\left\langle C,C\right\rangle} \\ &= \sqrt{\left\langle c \textstyle\sum_i \mathbb{I}_i, c \textstyle\sum_j \mathbb{I}_j\right\rangle} \\ &= c \sqrt{\textstyle\sum_{i,j}\left\langle\mathbb{I}_i, \mathbb{I}_j\right\rangle} \\ &= c \sqrt{\textstyle\sum_j f(p_j)} \end{aligned} $$

Because the *p*_{j} are the
coefficients of a probability distribution, they must sum to one. Thus,
a good choice for *f* that
satisfies our desired properties is to set equal to the identity
function. This way we get

$$ \left\lVert C\right\rVert_2 = c \sqrt{\textstyle\sum_j f(p_j)} = c \sqrt{\textstyle\sum_j p_j} = c \cdot 1 = c $$

Now we finally have an inner product that coherently represents the probabilistic structure of our random variables! On the indicators basis, it is defined as

$$ \left\langle\mathbb{I}_i, \mathbb{I}_j\right\rangle = \begin{cases} p_i, & i = j \\ 0,& \text{otherwise}. \end{cases} $$

While for general *X* and
*Y*, we can use linearity to
discover that this amounts to

⟨*X*,*Y*⟩ = ∑_{j}*p*_{j}*x*_{j}*y*_{j}.

Recall that what an inner product essentially defines are notions of
distances and angles between vectors. Particularly, the distance is
given by $\mathrm{dist}({X},{ Y}) =
\sqrt{\left\lVert X-Y\right\rVert_2}$. Through it, we finally
have the necessary tools to rigorously describe the
**average** using a variational principle, as we talked
about on the introduction.

Let *X* be a random variable.
We call its *average* 𝔼[*X*] the value of the constant
random variable that is closest to it.

$$ \begin{array}{rl} \mathbb{E}[X] = \argmin\limits_{c} & \mathrm{dist}({X},{ C}) \\ \textrm{s.t.} & C(\omega) = c, \forall \omega \end{array} $$

Let’s distill this definition a bit. For us “closest” means “vector
that minimizing the distance”. Remembering that all constant random
variables lie on the same line, we get from Geometry/Linear Algebra that
𝔼[*X*] is the result of the
*orthogonal projection* of *X* on ∑_{j}𝕀_{j}.
(which is a unitary vector) What we really want to calculate can be
summarized as a picture:

In order to represent this as a closed formula, we must remember that
squaring a non-negative function does not change the point that
minimizes it (because it is convex and monotone). Our previous
definition can then be more concretely written as this *least squares
problem*:

$$ \begin{array}{rl} \mathbb{E}[X] = \argmin\limits_{c} & \frac{1}{2}\left\lVert X-C\right\rVert_2^2 \\ \textrm{s.t.} & C = c \sum_j \mathbb{I}_j. \end{array} $$

Least Squares notoriously have a closed form solution but let’s
derive it on this particular case. First we turn it into an
unconstrained problem by substituting *C* in the objective function; which
makes it look like $\frac{1}{2}\sum_j p_i (x_i
- c)^2$. Since this is convex, all we need to find the minimizer
is applying the good old method of differentiating **with respect
to c** and equating to zero.

$$ \begin{aligned} 0 &= \frac{d}{dc} \left(\frac{1}{2}\textstyle\sum_i p_i (x_i - c)^2\right) \\ &= \frac{1}{2}\textstyle\sum_i p_i \cdot 2 (x_i - c) \\ &= \textstyle\sum_i p_i x_i - \textstyle\sum_i p_i c \\ &= \left(\textstyle\sum_i p_i x_i\right) - c \end{aligned} $$

By moving the scalar *c* to
the equation’s left-hand side, we finally found a formula for the
expected value. And, of course, it meets our expectation!^{3}

$$ \boxed{\mathbb{E}[X] = \sum_{i=1}^N p_i x_i} $$

We just defined the average of *X* through the closest constant to
it. A question that naturally arises is how well this represents *X*. We can answer this by looking to
the projection’s *error*, that is, the distance between *X* and 𝔼[*X*]. (In what follows, we sometimes
overload the notation 𝔼[*X*] to
also mean the constant random variable with norm 𝔼[*X*]. What we are talking about
should be clear from context)

$$ \begin{aligned} \text{error} &= X - \mathbb{E}[X], \\ \sigma[X] &= \left\lVert\text{error}\right\rVert_2 = \mathrm{dist}({X},{ \mathbb{E}[X]}) \\ &= \sqrt{\sum_{i=1}^N p_i (x_i - \mathbb{E}[X])^2}. \end{aligned} $$

You probably already know this formula by the name of
**standard deviation**! Although in probability textbooks
it appears as the concise but opaque formula $\sigma[X] = \sqrt{\mathbb{E}[(X -
\mathbb{E}[X])^2]}$, here it naturally appears as the size of the
least squares’ error from approximating *X* by a constant.

In real world probability, one of the most widely used kinds of
random variables are certainly the *Gaussian Random Variables*.
Among their fundamental properties is that their distribution is
uniquely defined by its mean and standard deviation^{4}.
This part is only my opinion and no mathematics, but I like to think
that what makes them so special is that they can be recovered solely by
knowing their approximation by a constant and how far off this
approximation is.

Until now, we have only been looking for distances and orthogonal
projections. Nevertheless, the inner product also tells us about the
angle between two vectors! From Euclidean Geometry, we know that the
angle *θ* between two vectors
satisfies

⟨*A*,*B*⟩ = ∥*A*∥_{2}∥*B*∥_{2}cos *θ*.

An interesting use of this angle is in calculating the
**correlation** between random variables. To find it, we
first calculate the errors of approximating *X* and *Y* by their means. Then, the
correlation is defined exactly as the cosine of the *angle*
between these errors. That is, if we let *θ* equal the angle between *X* − 𝔼[*X*] and *Y* − 𝔼[*Y*], then

$$ \begin{aligned} \mathrm{corr}[X, Y] &= \cos(\theta) \\ &= \frac{\left\langle X - \mathbb{E}[X], Y - \mathbb{E}[Y]\right\rangle}{\sigma[X]\sigma[Y]} \end{aligned} $$

What does the correlation mean, you may ask? Well, in usual
statistical parlance, the farther the correlation is from zero, the
closer the random variables are from being linear functions from one
another, i.e. *Y* = *a**X* + *b*.

Up until this point, we only considered distances the “classical” or Euclidean sense. But sometimes it is not the most appropriate way to measure how far things are. For example, if you are driving in a city, there is no way to cross diagonally between blocks. The best way to describe the distance between two points is through the size of the horizontal and vertical increments one has to traverse from one point to another. Thus let’s talk a bit about these more “exotic” distances.

Coming from our example above, we define the *L*^{1}-norm^{5} of
*X* analogously to the Euclidean
norm as

$$ \left\lVert X\right\rVert_1 = \sum_{i = 1}^N p_i |x_i|. $$

This is just like the Euclidean norm we were talking about but
exchanging the squares by absolute values. Instead of measuring the
length of the line segment between points, it measures the length we
must traverse if we could only walk through a grid parallel to the
indicator functions 𝕀_{j}. This distance is
widely used on the literature for its robustness and sparsity-inducing
properties. Also, while the Euclidean distance is rotation invariant,
the *L*^{1}-norm clearly
has some preference towards the outcomes’ indicators. So, our
variational principle using this distance is

$$ \begin{array}{rl} \min\limits_{c} & \sum_{i=1}^N p_i |x_i - c| \end{array} $$

This optimization problem can be rewritten as a *linear
program* as has a (possibly non-unique) solution that is also used
all around Statistics: the **median**. This is a constant
*μ* with the nice property $\mathrm{P}(X \le \mu) \ge \frac{1}{2}$ and
$\mathrm{P}(X \ge \mu) \ge
\frac{1}{2}$.

For any measure of distance between random variables between points,
there is some associated constant that minimizes it and, in general, is
already an important object in Statistics. As an example, try to
discover which is the closest constant to *X* by using the ∞-norm: ∥*X*−*Y*∥_{∞} = max_{ω}*P*(*ω*)|*X*(*ω*)−*Y*(*ω*)|.
In fact, we don’t even have to use a norm! Any meaningful notion of
distance can to be used to project (non-linearly) on the space of
constants.

As a last example before we finish, let’s consider a distance that
does not come from a norm. Consider dist(*X*,*Y*) = P(*X*≠*Y*).
This is an interesting distance and widely used in signal processing
because when *Y* = 0, it
measures the sparsity of *X* as
a vector. Can you guess what constant minimizes it? Try to prove that it
is the *mode*^{6} of *X*!

When I started thinking of these ideas, I talked about them to Hugo Nobrega and João Paixão. This post only came to life because of how excited they got with it. Even so that they convinced me it could be interesting to write about it.

I also want to thank Daniel Hashimoto and Ivani Ivanova for reading the first version of this post and helping me with their “bug reports”.

I don’t know why, but my best ideas always seem to come while I’m sleeping or in the shower.↩︎

If you know some measure theory, you know that this definition is oversimplified. It is missing the

*σ*-algebra, measurability and other complicated stuff. But, well, the focus of this post is on intuition thus I’m ok with sacrificing generality and some rigour for the sake of simplicity. Nevertheless, we can always think that there is an implicit discrete*σ*-algebra everywhere.↩︎ba dum tss.↩︎

Or variance if you prefer it squared.↩︎

Also known by the much more stylish names of Manhattan or taxicab’s norm.↩︎

The

*mode*is the value returned with the greatest probability.↩︎

I offered them what I consider the perfect exhibition for this:
*Let’s make a solver for a Calculus exam!*

Calculus is a subject that in their College years, everybody learns to respect (or fear). Thus, at first sight this may seem too monumental of a task for a mere exposition. But what if I told you that if we restrict ourselves to derivatives, it takes about a hundred lines of code? A lot of people are not used to thinking of Calculus this way, but computing derivatives is actually a pretty straightforward algorithm.

One thing that one of those friends, who is a Professor in the
Department of Computer Science, said really resonated with me: “People
would struggle much less with math if they learned in school how to
write syntax trees.”^{1}

I really liked this phrase and would add even more: learning about syntax trees (and their siblings s-expressions) and recursion eased my way not only with math but with learning grammar as well. How I wish that math and languages classes from school worked with concepts that are as uniform as they could. Well, enough rambling. Time to do some programming!

Before delving into the depths of first-year undergraduate math,
let’s take a step back and start with something simpler: *rational
functions*.

`module Calculus.Fraction where`

A rational function is formed of sums, products, and divisions of
numbers and a indeterminate symbol, traditionally denoted by *x*. An example is something like

$$ \frac{32x^4 + \frac{5}{4}x^3 - x + 21}{\frac{5x^{87} - 1}{23x} + 41 x^{76}}.$$

Let’s construct the rational functions over some field of numbers
`a`

. It should have *x*, numbers (called
*constants*), and arithmetic operations between them.

```
data Fraction a = X
| Const a
| (Fraction a) :+: (Fraction a)
| (Fraction a) :*: (Fraction a)
| (Fraction a) :/: (Fraction a)
deriving (Show, Eq)
```

I choose to give it the name `Fraction`

because rational
functions are represented by fractions of polynomials. We make it a
parameterized type because `a`

could be any numeric field,
just like in math we use the notations ℚ(*x*), ℂ(*x*), ℤ_{𝟙𝟟}(*x*) to denote the
rational functions over different fields.

Since we are using operator constructors, let’s give them the same associativity and fixity as the built-in operators.

```
infixl 6 :+: -- Left associative
infixl 7 :*:, :/: -- Left associative with higher precedence than :+:
```

For now our constructors are only formal, they just create syntax trees:

```
ghci> Const 2 :+: Const 2 :+: X
(Const 2 :+: Const 2) :+: X
it :: Num a => Fraction a
```

We can teach it how to simplify these equations but since the focus here is on derivatives, we will postpone this to a further section. Let’s say that right now our student will just solve the problems and return the exam answers in long-form without simplifying anything.

The next thing is thus teach it how to evaluate an expression at a value. The nice part is that in terms of implementation, that’s equivalent to writing an interpreter from the Fractions to the base field.

```
eval :: Fractional a => Fraction a -> a -> a
X c = c
eval Const a) _ = a
eval (:+: g) c = eval f c + eval g c
eval (f :*: g) c = eval f c * eval g c
eval (f :/: g) c = eval f c / eval g c eval (f
```

This is it. Our evaluator traverses the expression tree by turning
each `X`

leaf into the value `c`

, keeping
constants as themselves, and collapsing the nodes according to the
operation they represent. As an example:

```
ghci> p = X :*: X :+: (Const 2 :*: X) :+: Const 1
p :: Num a => Fraction a
ghci> eval p 2
9.0
it :: Fractional a => a
```

One nicety about languages like Haskell is that they are not only
good for writing DSls, but they are also good for writing *embedded
DSLs*. That is, something like our symbolic Fractions can look like
just another ordinary part of the language.

It won’t be nice to just write `X^2 + 2*X + 1`

instead of
the expression we evaluated above?

Well, we first need to teach or program how to use the built-in
numeric constants and arithmetic operations. We achieve this through the
typeclasses `Num`

and `Fractional`

. This is kind
of Haskell’s way of saying our type forms a Ring and Field.

```
instance Num a => Num (Fraction a) where
-- For the operations we just use the constructors
+) = (:+:)
(*) = (:*:)
(-- This serves to embed integer constants in our Ring.
-- Good for us that we already have a constructor for that.
fromInteger n = Const (fromInteger n)
-- This one is how to do `p -> -p`.
-- We didn't define subtraction, so let's just multiply by -1.
negate p = Const (-1) :*: p
-- These ones below are kinda the problem of `Num`...
-- As much as I don't like runtime errors,
-- for this exposition I think the best is
-- to just throw an error if the user tries to use them.
abs = error "Absolute value of a Fraction is undefined"
signum = error "Sign of a Fraction is undefined"
```

This makes our type into a Ring and we can now use constants,
`+`

and `*`

with it. The code to make it into a
Field is equally straightforward.

```
instance Fractional a => Fractional (Fraction a) where
/) = (:/:)
(fromRational r = Const (fromRational r)
```

Let’s see how it goes

```
ghci> (X^2 + 2*X + 1) / (X^3 - 0.6)
(((X :*: X) :+: (Const 2.0 :*: X)) :+: Const 1.0) :/: (((X :*: X) :*: X) :+: (Const (-1.0) :*: Const 0.5))
it :: Fractional a => Fraction a
```

What we wrote is definitely much cleaner than the internal representation. But there is still one more nicety: Doing this also gave us the ability to compose expressions! Recall the type of our evaluator function:

`eval :: Fractional field => Fraction field -> field -> field`

But we just implemented a `Fractional (Fraction a)`

instance! Thus, as long as we keep our Fractions polymorphic, we can
evaluate an expression at another expression.

```
ghci> eval (X^2 + 3) (X + 1)
((X :+: Const 1.0) :*: (X :+: Const 1.0)) :+: Const 3.0
it :: Fractional a => Fraction a
```

Alright, alright. Time to finally teach some calculus. Remember all the lectures, all the homework… Well, in the end, what we need to differentiate a rational function are only five simple equations: 3 tree recursive rules and 2 base cases.

```
diff :: Fractional a => Fraction a -> Fraction a
X = 1
diff Const _) = 0
diff (:+: g) = diff f + diff g
diff (f :*: g) = diff f * g + f * diff g
diff (f :/: g) = (diff f * g - f * diff g) / g^2 diff (f
```

Well, that’s it. Now that we’ve tackled the rational functions, let’s meet some old friends from Calculus again.

In calculus, besides rational functions, we also have sines, cosines, exponentials, logs and anything that can be formed combining those via composition or arithmetic operations. For example:

$$ \frac{1}{23}\log\left(\sin(3x^3) + \frac{e^{45x} - 21}{x^{0.49}\mathrm{asin}(-\frac{\pi}{x})}\right) + \cos(x^2)$$

This sort of object is called an Elementary
function in the math literature but here we will call it simply an
*expression*.

`module Calculus.Expression where`

Let’s create a type for our expressions then. It is pretty similar to
the `Fraction`

type from before with the addition that we can
also apply some transcendental functions.

```
data Expr a = X
| Const a
| (Expr a) :+: (Expr a)
| (Expr a) :*: (Expr a)
| (Expr a) :/: (Expr a)
| Apply Func (Expr a)
deriving (Show, Eq)
data Func = Cos | Sin | Log | Exp | Asin | Acos | Atan
deriving Show
```

The `Func`

type is a simple enumeration of the most common
functions that one may find running wild on a Calculus textbook. There
are also other possibilities but they are generally composed from those
basic building blocks.

Since the new constructor plays no role in arithmetic, We can define
instances `Num (Expr a)`

and `Fractional (Expr a)`

that are identical to those we made before. But having these new
functions also allows us to add a `Floating`

instance to
`Expr a`

, which is sort of Haskell’s way of expressing things
that act like real/complex numbers.

```
instance Floating a => Floating (Expr a) where
-- Who doesn't like pi, right?
pi = Const pi
-- Those are easy, we just need to use our constructors
exp = Apply Exp
log = Apply Log
sin = Apply Sin
cos = Apply Cos
asin = Apply Asin
acos = Apply Acos
atan = Apply Atan
-- We can write hyperbolic functions through exponentials
sinh x = (exp x - exp (-x)) / 2
cosh x = (exp x + exp (-x)) / 2
asinh x = log (x + sqrt (x^2 - 1))
acosh x = log (x + sqrt (x^2 + 1))
atanh x = (log (1 + x) - log (1 - x)) / 2
```

We already have our type and its instances. Now it is time to also consider derivatives of the transcendental part of the expressions. The evaluator is almost equal except for a new pattern:

```
eval :: Floating a => Expr a -> a -> a
X c = c
eval Const a) _ = a
eval (:+: g) c = eval f c + eval g c
eval (f :*: g) c = eval f c * eval g c
eval (f :/: g) c = eval f c / eval g c
eval (f Apply f e) c = let g = calculator f
eval (in g (eval e c)
```

We also had to define a `calculator`

helper that
translates between our `Func`

type and the actual functions.
This is essentially the inverse of the floating instance we defined
above but I couldn’t think of a way to do that with less boilerplate
without using some kind of metaprogramming.^{2}

```
calculator :: Floating a => Func -> (a -> a)
Cos = cos
calculator Sin = sin
calculator Log = log
calculator Exp = exp
calculator Asin = asin
calculator Acos = acos
calculator Atan = atan calculator
```

The derivative is pretty similar, with the difference that we
implement the chain rule instead of for the `Apply`

constructor.

Let’s start by writing a cheatsheet of derivatives. This is the kind
of thing you’re probably not allowed to carry to a Calculus exam, but
let’s say that our program has it stored in its head (provided this
makes any sense). Our cheatsheet will get a `Func`

and turn
it into the expression of its derivative.

```
cheatsheet :: Floating a => Func -> Expr a
Sin = cos X
cheatsheet Cos = negate (sin X)
cheatsheet Exp = exp X
cheatsheet Log = 1 / X
cheatsheet Asin = 1 / sqrt (1 - X^2)
cheatsheet Acos = -1 / sqrt (1 - X^2)
cheatsheet Atan = 1 / (1 + X^2) cheatsheet
```

Finally, the differentiator is exactly the same as before except for a new pattern that looks for the derivative on the cheatsheet and evaluates the chain rule using it.

```
diff :: Floating a => Expr a -> Expr a
X = 1
diff Const _) = 0
diff (:+: g) = diff f + diff g
diff (f :*: g) = diff f * g + f * diff g
diff (f :/: g) = (diff f * g - f * diff g) / g^2
diff (f Apply f e) = let f' = cheatsheet f
diff (in eval f' e * diff e
```

This way we finish our Calculus student program. It can write any elementary function as normal Haskell code, evaluate them, and symbolically differentiate them. So what do you think?

Although we finished our differentiator, there are a couple of topics that I think are worth discussing because they are simple enough to achieve and will make our program a lot more polished or fun to play with.

Definitely the least elegant part of our program is the expression simplifier. It is as straightforward as the rest, consisting of recursively applying rewriting rules to an expression, but there are a lot of corner cases and possible rules to apply. Besides that, sometimes which equivalent expression is the simple one can be up to debate.

We first write the full simplifier. It takes an expression and apply rewriting rules to it until the process converges, i.e. the rewriting does nothing. We use a typical tail-recursive loop for this.

```
simplify :: (Eq a, Floating a) => Expr a -> Expr a
= let expr' = rewrite expr
simplify expr in if expr' == expr
then expr
else simplify expr'
```

From this function, we can define a new version of `diff`

that simplifies its output after computing it.

`= simplify . diff diffS `

The bulk of the method is formed by a bunch of identities. You can
think of them as the many math rules that a student should remember in
order to simplify an expression while solving a problem. Since there is
really no right answer^{3} when we are comparing equals with
equals, any implementation will invariably be rather ad hoc. One thing
to remember though is that the rules should eventually converge. If you
use identities that may cancel each other, the simplification may never
terminate.

```
-- Constants
Const a :+: Const b) = Const (a + b)
rewrite (Const a :*: Const b) = Const (a * b)
rewrite (Const a :/: Const b) = Const (a / b)
rewrite (Apply func (Const a)) = Const (calculator func a)
rewrite (-- Associativity
:+: (g :+: h)) = (rewrite f :+: rewrite g) :+: rewrite h
rewrite (f :*: (g :*: h)) = (rewrite f :*: rewrite g) :*: rewrite h
rewrite (f -- Identity for sum
:+: Const 0) = rewrite f
rewrite (f Const 0 :+: f) = rewrite f
rewrite (:+: Const a) = Const a :+: rewrite f
rewrite (f -- Identity for product
:*: Const 1) = rewrite f
rewrite (f Const 1 :*: f) = rewrite f
rewrite (:*: Const 0) = Const 0
rewrite (f Const 0 :*: f) = Const 0
rewrite (:*: Const a) = Const a :*: rewrite f
rewrite (f -- Identity for division
Const 0 :/: f) = Const 0
rewrite (:/: Const 1) = rewrite f
rewrite (f -- Inverses
:/: h)
rewrite (f | f == h = Const 1
:*: g) :/: h)
rewrite ((f | f == h = rewrite g
| g == h = rewrite f
:+: (Const (-1) :*: g))
rewrite (f | f == g = Const 0
-- Function inverses
Apply Exp (Apply Log f)) = rewrite f
rewrite (Apply Log (Apply Exp f)) = rewrite f
rewrite (Apply Sin (Apply Asin f)) = rewrite f
rewrite (Apply Asin (Apply Sin f)) = rewrite f
rewrite (Apply Cos (Apply Acos f)) = rewrite f
rewrite (Apply Acos (Apply Cos f)) = rewrite f
rewrite (Apply Atan ((Apply Sin f) :/: (Apply Cos g)))
rewrite (| f == g = rewrite f
-- Recurse on constructors
:+: g) = rewrite f :+: rewrite g
rewrite (f :*: g) = rewrite f :*: rewrite g
rewrite (f :/: g) = rewrite f :/: rewrite g
rewrite (f Apply f e) = Apply f (rewrite e)
rewrite (-- Otherwise stop recursion and just return itself
= f rewrite f
```

I find it interesting to look at the size of this
`rewrite`

function and think of the representation choices we
made along this post. There are many equivalent ways to write the same
thing, forcing us to keep track of all those equivalence relations.

One of the niceties of working with a lazy language is how easy it is
to work with infinite data structures. In our context, we can take
advantage of that to write the *Taylor Series* of an
expression.

The Taylor series of *f* at a
point *c* is defined as the
infinite sum

$$ f(x) = \sum_{n = 0}^\infty \frac{f^{(n)}(c)}{n!} (x-c)^n.$$

Let’s first write a function that turns an expression and a point into an infinite list of monomials. We do that by generating a list of derivatives and factorials, which we assemble for each natural number.

```
taylor :: (Eq a, Floating a) => Expr a -> a -> [Expr a]
= simplify <$> zipWith3 assemble [0..] derivatives factorials
taylor f c where
= Const (eval f' c / nfat) * (X - Const c)^n
assemble n f' nfat -- Infinite list of derivatives [f, f', f'', f'''...]
= iterate diff f
derivatives -- Infinite list of factorials [0!, 1!, 2!, 3!, 4!...]
= fmap fromInteger factorials'
factorials = 1 : zipWith (*) factorials' [1..] factorials'
```

We can also write the partial sums which only have *N* terms of the Taylor expansion.
These have the computational advantage of actually being evaluable.

```
approxTaylor :: (Eq a, Floating a) => Expr a -> a -> Int -> Expr a
= (simplify . sum .take n) (taylor f c) approxTaylor f c n
```

At last, a test to convince ourselves that it works.

```
ghci> g = approxTaylor (exp X) 0
g :: (Eq a, Floating a) => Int -> Expr a
ghci> g 10
((((((((Const 1.0 :+: X) :+: ((Const 0.5 :*: X) :*: X)) :+: (((Const 0.16666666666666666 :*: X) :*: X) :*: X)) :+: ((((Const 4.1666666666666664e-2 :*: X) :*: X
) :*: X) :*: X)) :+: (((((Const 8.333333333333333e-3 :*: X) :*: X) :*: X) :*: X) :*: X)) :+: ((((((Const 1.388888888888889e-3 :*: X) :*: X) :*: X) :*: X) :*: X
) :*: X)) :+: (((((((Const 1.984126984126984e-4 :*: X) :*: X) :*: X) :*: X) :*: X) :*: X) :*: X)) :+: ((((((((Const 2.48015873015873e-5 :*: X) :*: X) :*: X) :*
: X) :*: X) :*: X) :*: X) :*: X)) :+: (((((((((Const 2.7557319223985893e-6 :*: X) :*: X) :*: X) :*: X) :*: X) :*: X) :*: X) :*: X) :*: X)
it :: (Eq a, Floating a) => Expr a
ghci> eval (g 10) 1
2.7182815255731922
it :: (Floating a, Eq a) => a
```

This post only exists thanks to the great chats I had with João Paixão and Lucas Rufino. Besides listening to me talking endlessly about symbolic expressions and recursion, they also asked a lot of good questions and provided the insights that helped shape what became this post.

I also want to thank the people on reddit that noticed typos and gave suggestions for code improvement.

The phrase wasn’t exactly that. It had a better effect. But it has been almost a week and I have the memory of a goldfish. The intention is preserved thought.↩︎

If you have any suggestions to make this code mode elegant, feel free to contact me and we can edit it. :)↩︎

Countless times I only finished a proof because I used an identity on the non-intuitive side. For example writing something like

*f*(*x*) =*f*(*x*) ⋅ 1 =*f*(*x*) ⋅ (sin(*x*)^{2}+cos(*x*)^{2}). Those are always some great*A-ha!*moments.↩︎

The Fast
Fourier Transform (FFT for short) uses a divide-and-conquer strategy
to allow us to calculate the discrete Fourier transform in *O*(*N*log*N*) (where
*N* is the length of the input).
It has a myriad of applications, specially in signal processing and data
analysis.

But today I am not explaining it. Oh no, that would require a much
longer post. Today we will only use the FFT as an example of how to
write a fully fetched algorithm using neither loops nor explicit
recursion. Since it is more complicated than the previous post’s
examples, this time we will need a real programming language, thus
everything will be in Haskell. I did this implementation as an exercise
in recursion schemes, so I’m aiming more towards clarity than efficiency
in here. Therefore, we will represent vectors using Haskell’s standard
type for linked lists. But if you want to adapt it to some other type
such as `Vector`

or `Array`

, I think the
conversion should be pretty straightforward.

One way to look at the discrete Fourier transform is as a exchange
between orthonormal bases on ℂ^{N}. Given a vector *v* ∈ ℂ^{n} and a
basis *e*_{t},
we can write *v* as^{1} $$v =
\sum_{t=0}^{N-1} x_t e_t,$$ where (*x*_{0},…,*x*_{N − 1})
are the coefficients of *v* on
the basis *e*_{t}. If, for
example, *v* represents a
discrete temporal series, its coefficients would represent the amplitude
at each sampled instant. In a lot of cases, however, we are also
interested in the amplitudes for the frequencies of this vector. To
calculate those, we exchange to another basis *f*_{k} (related to
*e*_{t}) called
the *Fourier basis* and write *v* as $$ v
= \sum_{k=0}^{N-1} y_k f_k$$ where the new coefficients *y*_{k} are defined
as $$ y_k = \sum_{t=0}^{N-1} x_t \cdot
e^{-{2\pi i \over N}t k}.$$

We have our first formula, time to implement it in Haskell! Let’s begin by importing the libraries we will need.

```
import Data.Complex
import Data.List (foldr, unfoldr)
```

The functions `foldr`

and `unfoldr`

are,
respectively, Haskell’s built-in catamorphism and anamorphism for
lists.

The formula for the *y*’s
receives as list of coefficients and returns another list of
coefficients

`dft :: [Complex Double] -> [Complex Double]`

The view we will take in here is that the input represents a parameter and we will build a new list in terms of it. This lends us to the following anamorphism:

```
= unfoldr coalg 0
dft xs where
= fromIntegral $ length xs
dim = cis (-2 * pi * k * t / dim)
chi k t = if k < dim
coalg k then let cfs = fmap (chi k) [0 .. dim - 1]
= sum (zipWith (*) cfs xs)
yk in Just (yk, k + 1)
else Nothing
```

If you’ve never seem the function `cis`

before, it is
Haskell’s imaginary exponential *a* ↦ *e*^{ia}.
The function `dft`

builds the `y`

’s one step at a
time. There are *N* coefficients
*y*_{k} and
each of them requires *O*(*N*) calculations, thus the
complexity is *O*(*N*^{2}). Although
this is not monstrous, it can be a bottleneck for real-time
computations. Fortunately we can do better than this.

The function `dft`

implements the Fourier transform
exactly as it is defined. However, if you take a look at the
coefficients used, you will see that there’s a certain pattern to them.
A way to exploit this pattern is encoded in the Danielson-Lanczos
Lemma.

In even dimensions, the Fourier transform satisfies $$y_k = y^{(e)}_k + e^{-{2\pi i \over N} k} \cdot
y^{(o)}_k,$$ where *y*^{(e)} is the
Fourier transform of its even-indexed coefficients and *y*^{(o)} is the
Fourier transform of its odd-indexed coefficients.

$$ \begin{aligned} y_k &= \sum_{t=0}^{N-1} x_t \cdot e^{-{2\pi i \over N} t k} \\ &= \sum_{t=0}^{N/2-1} x_{2t} \cdot e^{-{2\pi i \over (N/2)} t k} + e^{-{2\pi i \over N} k} \sum_{t=0}^{N/2-1} x_{2t + 1} \cdot e^{-{2\pi i \over (N/2)} t k} \\ &= y^{(e)}_k + e^{-{2\pi i \over N} k} \cdot y^{(o)}_k. \end{aligned}$$

Well, that’s a lot of symbols… The important part for us is that we
can break the DFT into two smaller problems and then merge them back
together with a *O*(*N*)
procedure. A divide-and-conquer algorithm!

Let’s model this as a hylomorphism. For simplicity (and to avoid the boilerplate of fixed points) we will use the direct definition of a hylo.

`= f . fmap (hylo f g) . g hylo f g `

The FFT will take an input list, properly divide it into even and odds indices and then conquer the solution by merging the subproblems into the output list.

```
fft :: [Complex Double] -> [Complex Double]
= hylo conquer divide fft
```

Alright, time to implement the actual algorithm. Let’s begin with the
`divide`

step. Our algorithm must take a list of complex
numbers and rewrite it as a binary tree whose leafs are sublists of odd
dimension and nodes represent splitting an even-dimensional list. The
data structure that represents this call tree is

```
data CallTree a x = Leaf [a] | Branch x x
instance Functor (CallTree a) where
fmap _ (Leaf xs ) = Leaf xs
fmap f (Branch xs ys) = Branch (f xs) (f ys)
```

The bunk of the `divide`

method consists of splitting a
list into even and odd components. We can do this in *O*(*n*) steps using a fold
from a list to a pair of lists.

```
split :: [a] -> ([a], [a])
= foldr f ([], []) where f a (v, w) = (a : w, v) split
```

Finally, `divide`

represents one step of constructing the
call tree. If the list’s length is even, we split it into smaller lists
and store them in a `Branch`

to later apply the
Danielson-Lanczos Lemma. In case it is odd, there are no optimizations
we can do, thus we just store the list’s DFT in a `Leaf`

.

```
= if even (length v)
divide v then uncurry Branch (split v)
else Leaf (dft v)
```

This constructs the call tree. Now it’s time to deal with the
`conquer`

step. First we notice that thanks to the
periodicity of the complex exponential (and the famous Euler formula),
$$ e^{{-2\pi i \over N } (k + {N \over 2})} =
-e^{{-2\pi i \over N } k }.$$ From this, we can reconstruct the
FFT from the smaller subproblems as $$\begin{aligned}
y_k &= y^{(e)}_k + e^{-{2\pi i \over N} k} \cdot
y^{(o)}_k, \\
y_{k+{N \over 2}} &= y^{(e)}_k - e^{-{2\pi i \over N} k} \cdot
y^{(o)}_k.
\end{aligned}$$

In Haskell, we can apply both the reconstruction formulas and then concatenate the results.

```
Leaf v) = v
conquer (Branch ye yo) = zipWith3 f [0..] ye yo
conquer (++ zipWith3 g [0..] ye yo
where
= fromIntegral (2 * length ye)
dim = e + cis (-2 * pi * k / dim) * o
f k e o = e - cis (-2 * pi * k / dim) * o g k e o
```

The main advantage of writing code like this is that it is extremely
modularized. We have many small almost-independent snippets of code that
are combined through the magic of a `hylo`

into an extremely
powerful algorithm. As bonus, if you want to test this code by yourself,
you can also invert the fft to recover the original coefficients in
*O*(*N*log*N*)
through

```
= fmap ((/dim) . conjugate) . fft . fmap conjugate $ x
ifft x where dim = fromIntegral (length x)
```

Another interesting fact I’ve noticed is that, when working with
`Double`

s, the `fft`

has much less rounding errors
than the `dft`

. This probably occurs because we make less
floating-point operations and not because of the hylomorphism. But I
thought it worth noticing anyway.

Well, it’s all for today. Good morphisms to everyone!

In this context, life gets much easier if our vectors are zero-indexed.↩︎

Today we will redefine these symbolic derivatives using a different
approach: *automatic differentiation*. This new way to calculate
derivatives will only depend on the evaluation function for expressions.
This decouples differentiation from whatever representation we choose
for our expressions and, even more important, it is always nice to learn
different ways to build something!

I first heard of this idea while reading the documentation of the ad package and just had my mind blown. In here we will follow a simpler approach by constructing a simple AD implementation but for any serious business, i really recommend taking a look at that package. It is really awesome.

```
{-# LANGUAGE RankNTypes #-}
{-# LANGUAGE StandaloneDeriving, DeriveFunctor #-}
module Calculus.AutoDiff where
import Calculus.Expression
```

Recall our evaluation function from the previous post. Its signature was

`eval :: Floating a => Expr a -> a -> a`

The way we interpreted it was that if we supplied an expression
`e`

and a value `c`

of type `a`

, it
would collapse the expression substituting all instances of the variable
`X`

by `c`

and return the resulting value. But
thanks to currying we may also view `eval`

as taking an
expression `e`

and returning a Haskell function
`eval e :: a -> a`

. Thus our code is capable of
transforming expressions into functions.

At this point, one may ask if if we can do the opposite. So, can we
take an ordinary Haskell function and find the symbolic expression that
it represents? The answer, quite surprisingly to me, is: *yes,
provided that it is polymorphic*.

If you take a function such as `g :: Double -> Double`

that only works for a single type^{1}, all hope is lost. Any
information regarding “the shape” of the operation performed by the
function will have already disappeared at runtime and perhaps even been
optimized away by the compiler (as it should be). Nevertheless,
polymorphic functions that work for any Floating type, such as
`f :: Floating a => a -> a`

, are flexible enough to
still carry information about its syntax tree even at runtime. One
reason for this is that we defined a `Floating`

instance for
`Expr a`

, allowing the function `f`

to be
specialized to the type `Expr a -> Expr a`

. Thus we can
convert between polymorphic functions and expressions.

`uneval :: (forall a. Floating a => a -> a) -> (forall b. Floating b => Expr b)`

Notice the explicit `forall`

: `uneval`

only
accepts polymorphic arguments.^{2} After finding the right
type signature, the inverse to `eval`

is then really simple
to write. The arithmetic operations on a `Expr a`

just build
a syntax tree, thus we can construct an expression from a polymorphic
function by substituting its argument by the constructor
`X`

.

`= f X uneval f `

Let’s test it on ghci to see that it works:

```
ghci> uneval (\x -> x^2 + 1)
X :*: X :+: Const 1.0
it :: Floating b => Expr b
ghci> uneval (\x -> exp (-x) * sin x)
Apply Exp (Const (-1.0) :*: X) :*: Apply Sin X
it :: Floating b => Expr b
```

The `uneval`

function allows us to compute a syntax tree
for a polymorphic function during a program’s runtime. We can then
manipulate this expression and turn the result back into a function
through `eval`

. Or, if we know how to do some interesting
operation with functions, we can do the opposite process and apply it to
our expression! This will be our focus on the next section.

In math, derivatives are concisely defined via a limiting process: $$ f'(x) = \lim_{\varepsilon \to 0}\frac{f(x + \varepsilon) - f(x)}{\varepsilon}. $$

But when working with derivatives in a computer program, we can’t necessarily take limits of an arbitrary function. Thus, how to deal with derivatives?

One approach is *numerical differentiation*, where we
approximate the limit by using a really small *ε*:

```
= (f (x + eps) - f x) / eps
numDiff' eps f x
= numDiff' 1e-10 numDiff
```

This is prone to numerical stability issues and doesn’t compute the real derivative but only an approximation to it.

Another approach is what we followed in the previous post:
*symbolic differentiation*. This is the same way that one is used
to compute derivatives by hand: you take the algebraic operations that
you learned in the calculus class and implement them as transformations
on an expression type representing a syntax tree. One difficult of this,
as you may have noticed, is that symbolic calculations require lots of
rewriting to get the derivative in a proper form. They also require that
you work directly with expressions and not with functions. This, despite
being mitigated by our `eval`

and `uneval`

operators, can be pretty inefficient when your code is naturally
composed of functions. Besides that, if we wanted to change our
`Expr`

type, for example, to use a more efficient operation
under the hood, or adding a `:^:`

constructor for power
operations, or adding new transcendental functions, we would have to
modify both our `eval`

and `diff`

functions to
consider this.

A third option that solves all the previous issues is Automatic
differentiation. This uses the fact that any
`Floating a => a -> a`

is in fact a composition of
arithmetic operations and some simple transcendental functions such as
`exp`

, `cos`

, `sin`

, etc. Since we know
how to differentiate those, we can augment our function evaluation to
calculate at the same time both the function value and the exact value
of the derivative at any given point. As we will see, we will even be
able to recover symbolic differentiation as a subcase of automatic
differentiation.

Here we will do the simplest case of automatic differentiation,
namely *forward-mode AD* using dual numbers. This
is only for illustrative purposes. If you are planning in to use
automatic differentiation in a program, I like recommend taking a look
at the ad
package.

In mathematics, a *dual number* is an expression *a* + *b**ε* with the
additional property that *ε*^{2} = 0. One can think of
it as augmenting the real numbers with an infinitesimal factor. As
another intuition: this definition is very similar to the complex
numbers, with the difference that instead of *i*^{2} = − 1, we have *ε*^{2} = 0.^{3}

The nicety of the dual numbers is that they can automatically
calculate the derivative of any analytic function. To view how this
works, let’s look at the Taylor Series of a `f`

expanded
around a point `a`

.

$$ f(a + b\varepsilon) = \sum_{n=0}^\infty \frac{1}{n!}f^{(n)} (b\varepsilon)^n = f(a) + bf'(a)\varepsilon. $$

Therefore, applying *f* to a
number with an infinitesimal part amounts to taking its first order
expansion.

Ok, back to Haskell. As usual, we translate this definition into Haskell as a parameterized data type carrying two values.

```
data Dual a = Dual a a
deriving (Show, Eq)
```

Later, it will also be useful to have functions extracting the real and infinitesimal parts of a dual number.

```
Dual a _) = a
realpart (Dual _ b) = b epsPart (
```

Alright, just like with expressions we will want to make
`Dual a`

into a number. The sum and product of two dual
numbers are respectively linear and bilinear because, well… Because we
wouldn’t be calling it “sum” and “product” it they weren’t. In math it
reads as

$$ \begin{aligned} (a + b\varepsilon) + (c + d\varepsilon) &= (a + c) + (b + d)\varepsilon, \\ (a + b\varepsilon) \cdot (c + d\varepsilon) &= ac + (bc + ad)\varepsilon + \cancel{bd\varepsilon^2}. \end{aligned}$$

If you found those as having a strong resemblance to the sum and product rules for derivatives, is because they do! These are our building blocks for differentiation.

```
instance Num a => Num (Dual a) where
-- Linearity
Dual a b) + (Dual c d) = Dual (a + c) (b + d)
(Dual a b) - (Dual c d) = Dual (a - c) (b - d)
(-- Bilinearity and cancel ε^2
Dual a b) * (Dual c d) = Dual (a * c) (b*c + a*d)
(-- Embed integers as only the real part
fromInteger n = Dual (fromInteger n) 0
-- These below are not differentiable functions...
-- But their first order expansion equals this except at zero.
abs (Dual a b) = Dual (abs a) (b * signum a)
signum (Dual a b) = Dual (signum a) 0
```

For division, we use the same trick as with complex numbers and multiply by the denominators conjugate.

$$ \frac{a + b\varepsilon}{c + d\varepsilon} = \frac{a + b\varepsilon}{c + d\varepsilon} \cdot \frac{c - d\varepsilon}{c - d\varepsilon} = \frac{ac + (bc - ad)\varepsilon}{c^2} = \frac{a}{c} + \frac{bc - ad}{c^2}\varepsilon $$

```
instance (Fractional a) => Fractional (Dual a) where
Dual a b) / (Dual c d) = Dual (a / c) ((b*c - a*d) / c^2)
(fromRational r = Dual (fromRational r) 0
```

Finally, to extend the transcendental functions to the dual numbers, we use the first order expansion described above. We begin by writing a helper function that represents this expansion.

```
-- First order expansion of a function f with derivative f'.
fstOrd :: Num a => (a -> a) -> (a -> a) -> Dual a -> Dual a
Dual a b) = Dual (f a) (b * f' a) fstOrd f f' (
```

And the floating instance is essentially our calculus cheatsheet again.

```
instance Floating a => Floating (Dual a) where
-- Embed as a real part
pi = Dual pi 0
-- First order approximation of the function and its derivative
exp = fstOrd exp exp
log = fstOrd log recip
sin = fstOrd sin cos
cos = fstOrd cos (negate . sin)
asin = fstOrd asin (\x -> 1 / sqrt (1 - x^2))
acos = fstOrd acos (\x -> -1 / sqrt (1 - x^2))
atan = fstOrd atan (\x -> 1 / (1 + x^2))
sinh = fstOrd sinh cosh
cosh = fstOrd cosh sinh
asinh = fstOrd asinh (\x -> 1 / sqrt (x^2 + 1))
acosh = fstOrd acosh (\x -> 1 / sqrt (x^2 - 1))
atanh = fstOrd atanh (\x -> 1 / (1 - x^2))
```

Now that we have setup all the dual number tooling, it is time to
calculate some derivatives. From the first order expansion *f*(*a*+*b**ε*) = *f*(*a*) + *b**f*′(*a*)*ε*,
we see that by applying a function to *a* + *ε*, that is, setting
*b* = 1, we calculate *f* and its derivative at *a*. Let’s test this in ghci:

```
ghci> f x = x^2 + 1
f :: Num a => a -> a
ghci> f (Dual 3 1)
Dual 10 6
it :: Num a => Dual a
```

Just as we expected! We can thus write a differentiation function by
doing this procedure and taking only the *ε* component.

`= epsPart (f (Dual c 1)) autoDiff f c `

Some cautionary words: remember from the previous discussion that to
have access to the structure of a function, we need it to be
polymorphic. In special, our `autoDiff`

has type
`Num a => (Dual a -> Dual b) -> (a -> b)`

. It
gets a function on dual numbers and spits out a function on numbers.
But, for our use case it is fine because we can specialize this
signature to

`autoDiff :: (forall a . Floating a => a -> a) -> (forall a . Floating a => a -> a)`

Recall we can use `eval`

to turn an expression into a
function and, reciprocally, we can apply a polymorphic function to the
constructor `X`

to turn it into an expression. This property,
which for the mathematicians among you probably resembles a lot a
similarity transformation, allows us to “lift” `autoDiff`

into the world of expressions. So, what happens if we take
`eval f`

and compute its derivative at the point
`X`

? We get the **symbolic derivative** of
`f`

of course!

`= autoDiff (eval f) X diff_ f `

Some tests in the REPL to see that it works:

```
ghci> diff_ (sin (X^2))
(Const 1.0 :*: X :+: X :*: Const 1.0) :*: Apply Cos (X :*: X)
it :: Floating a => Expr a
```

This function has a flaw nevertheless. It depends too much of
polymorphism. While our symbolic differentiator from the previous post
worked for an expression `f :: Expr Double`

, for example,
this new function depends on being able to convert `f`

to a
polymorphic function, which it can’t do in this case. This gets clear by
looking at the type signature of `diff_`

:

`diff_ :: Floating a => Expr (Dual (Expr a)) -> Expr a`

But not all hope is lost! Our differentiator works. All we need is to
discover how to turn an `Expr a`

into an
`Expr (Dual (Expr a))`

and we can get the proper type.

Let’s think… Is there a canonical way of embedding a value as an
expression? Of course there is! The `Const`

constructor does
exactly that. Similarly, we can view a “normal” number as a dual number
with zero infinitesimal part. Thus, if we change each coefficient in an
expression by the rule `\ c -> Dual (Const c) 0`

, we get
an expression of the type we need without changing any meaning.

To help us change the coefficients, let’s give a `Functor`

instance to `Expr`

. We could write it by hand but let’s use
some GHC magic to automatically derive it for us.

`deriving instance Functor Expr`

Finally, our differentiation function is equal to `diff_`

,
except that it first converts all coefficients of the input to the
proper type.

```
-- Symbolically differentiate expressions
diff :: Floating a => Expr a -> Expr a
= autoDiff (eval (fmap from f)) X
diff f where from x = Dual (Const x) 0
```

Just apply it to a monomorphic expression and voilà!

```
ghci> diff (sin (X^2) :: Expr Double)
(Const 1.0 :*: X :+: X :*: Const 1.0) :*: Apply Cos (X :*: X)
it :: Expr Double
```

- ad package on Hackage
- The Simple Essence of Automatic Differentiation by Conal Elliott.

Also known by the fancy name of

`monomorphic function`

. These are functions without any free type parameter. That is, no lowercase type variable appears in the type signature.↩︎This is not the most general possible definition of

`uneval`

. But it is simple and clear enough for this presentation.↩︎If you’re into Algebra, you can view the complex numbers as the polynomial quotient ℝ[

*X*]/⟨*X*^{2}+ 1⟩ while the dual numbers are ℝ[*X*]/⟨*X*^{2}⟩.↩︎

I began by teaching my friend some Lua, and then we started building our game with love2d. At first, the setup was basically me sharing my screen in Google Meet or Jitsi and him accompanying and asking questions. Since the lectures are pretty one-sided, this worked well but became too cumbersome when we started to make the game itself. All the “Now edit line 32” or “Go to the definition of f and change that” just weren’t cutting it. Furthermore, sometimes there are connection issues and the Meet sharing screen lags a lot.

We had some ups and downs finding a good setup for programming together, thus I started a quest in search of a better setup. This post is a step-by-step guide written where I try to document everything to make life easier for anyone trying to reproduce it (myself in the future included).

The video below in an example of this workflow using two terminals on the same machine (but still connecting via ssh).

Since both my friend and I are already Linux and (neo)vim users, it seemed reasonable to look for a terminal based setup.

The idea is the following: I begin by starting a shared tmux session. Then he connects to my machine via ssh (to his own user) and attaches to the same tmux session, this means we both can control and interact with the same shell. After that, we just have to navigate to the project’s folder and play with it as we wish.

Of course most of the steps above will be automatic. And did I already mention that it is only possible to stay connected via ssh while the tmux session exists?

I tested this with my machine running Arch Linux as host and my
friend’s machine running Ubuntu 20.04 or Ubuntu WSL. Therefore, the
commands below will always assume `pacman`

is the system’s
package manager and that `systemd`

is installed and in charge
of the services. But it should be easy to adapt it to any other Linux
host^{2}. For the guest, any system with a
ssh client should do.

Note: This guide also assumes you have root access on your machine.

We begin by ensuring everything we need is installed. On the host system we need a ssh server as well as tmux for sharing the session.

`sudo pacman -S openssh tmux`

Since I travel a lot and can’t always guarantee that I will have
admin access to my current router, I’m also using `ngrok`

for
tunnel reversing. I must admit that I don’t like letting a third party
handle this part and would be much more comfortable by using my own VPN
our some other solution. But since I don’t want to spend any money on
this and ngrok free plan attends my needs pretty well, the setup will
stay like this until I can think of a clever (and cheap) way to do this
part.

Therefore, just follow the instructions on ngrok’s site, sign up for an account, download it and ngrok will be ready to go. Also notice that if you are pair programming in a LAN, there is no need for ngrok as you may simply ssh using the private IPs. In our case, we are some 1000 kilometers away so some kind of port forwarding or reverse tunneling is more than needed.

Also ensure that you have a terminal based text editor: neovim, vim, emacs, nano, ed etc.

I must admit that I don’t really like giving ssh permissions for my
personal user. Even though I trust my friends^{3}, it
is usually better to let them only play with a sandboxed part of my
machine. On the other side, when we are pair programming, I want to let
them have access to my files. The steps below are a nice way to retie
those two opposite goals.

We begin by creating an user and giving it a password. I will call it
`frienduser`

but you may name it whatever you like.

```
sudo useradd -m frienduser
sudo passwd frienduser
```

Now we create a group for both our users. I will call it
`tmux`

, perhaps unimaginatively, but, again, you can call it
whatever you want. Be sure to add your own user and the newly created
user to the group.

```
sudo groupadd tmux
sudo usermod -a -G tmux youruser
sudo usermod -a -G tmux frienduser
```

This is the only step on the setup that you will need to call your friend. Send him a message asking for his public key.

Of course, if he is not as tech savvy as you, it is possible that he
has no idea what this means. Then you must tell him to also guarantee
that he has ssh installed and send you the content in the file
`~/.ssh/id_rsa.pub`

. If it does not exist, he must run the
command

`ssh-keygen`

And follow the instructions there. Setting a password for his private key when queried is also a good idea.

Now the file `~/.ssh/id_rsa.pub`

should exist in his
machine and he can send you the content.

Note: I feel I shouldn’t have to mention this but for the sake of
comprehensivity I will… The command `ssh-keygen`

creates two
files, a private key (`id_rsa`

) and a public key
(`id_rsa.pub`

). One should **never** send the
private key.

Now that you have the key at hand, you must let ssh know of it. The
ssh server looks for authorized keys for a given user on the file
`$HOME/.ssh/authorized_keys`

. Thus, we must create this file
and add the key to it on frienduser’s home folder.

```
sudo -u frienduser mkdir -p /home/frienduser/.ssh
sudo -u frienduser echo -n 'command="tmux -u -S /tmp/sharedtmux attach -t gamemaking" ' >> /home/frienduser/.ssh/authorized_keys
sudo -u frienduser echo $PUBLIC_KEY >> /home/frienduser/.ssh/authorized_keys
```

The middle line may seem weird at first. It means that your friend’s connection will automatically put him in the right shared tmux session. This is discussed more when we create the shared tmux session. As a positive side effect, the connection will only be accepted if the session exists.

This section is totally optional and is here just to added security.
On some distros, the OpenSSH server defaults to only accept connections
using key pairs while others will accept logins by password. Here we
will see how to configure ssh require both a key *and* a
password.

OpenSSH stores its server configuration in the file
`/etc/ssh/sshd_config`

^{4}. Just open it with your
favorite editor and add the following line to the end:

`AuthenticationMethods "publickey,password"`

Now the server will ask for both a key and a password before allowing
login. Keep in mind that there are other options for authentication
methods besides these two, but we’re not covering this today. For a full
list, take a look at the man
page for `sshd_config(5)`

.

Also know that these changes only apply after you restart the service.

When multiple clients connect to the same tmux session, it will by
default resize the screen on each command to the same of the last client
to do something. I personally find all this resizing pretty annoying and
prefer to configure tmux to stick with a size that fits all connected
clients. All you have to do is edit `~/.tmux.conf`

and add
the following lines:

```
set -g window-size smallest
setw -g aggressive-resize on
```

Now tmux will always use the smallest size among all clients, so it is a good idea to tell everybody working on this setup to use their terminals on fullscreen.

You should start by guaranteeing that the ssh server daemon is running. In Arch (or any other systemd based distro),

`systemctl status sshd`

If the service is inactive or dead, start it with

`systemctl start sshd`

You will be queried for your password and then it is ready to go.

I also tested this in WSL2^{5} but in there Ubuntu does
not use systemd (maybe for some system incompatibilities?). If you are
following this tutorial from WSL2, the respective commands are

```
sudo service ssh status
sudo service ssh start
```

We will store the shared session in a temporary file that can be accessed both by your and your friend’s user. First create the session with a special socket file:

`tmux -S /tmp/sharedtmux new-session -s gamemaking -d`

Here I am calling the session `gamemaking`

and storing the
socket in `/tmp/sharedtmux`

but, again, you can give them
whatever name you want. I also decided to use a temporary file because I
prefer these sockets to be disposable. However, you may prefer to store
it to have some kind of persistence on your session’s layout and open
programs. The only important thing in here is that the folder should be
visible for all user in the `tmux`

group.

Now we allow both users to read and write this file,

```
chgrp tmux /tmp/sharedtmux
chmod g+rw /tmp/sharedtmux
```

This way, both of them will have total control of the tmux process.

If you followed the steps on ngrok’s manual, it should suffice to run a tcp connection on port 22 (This is ssh’s default port).

`ngrok tcp 22`

When the TUI starts, there will be a line looking something like

`Forwarding tcp://2.tcp.ngrok.io:15727 -> localhost:22`

The `2.tcp.ngrok.io`

part is your assigned hostname (it is
always an arbitrary number followed by `.tcp.ngrok.io`

) the
number after the colon, `15727`

in this example, is the port.
You should copy both the hostname and port somewhere and send it to your
friend. These are redirecting directly into your machine.

An important note: you must leave ngrok’s window open. If you exit it, the process will be killed and the tunnel closed.

Now, in a new terminal you can attach to the tmux session and wait for your friend to connect.

`tmux -u -S /tmp/sharedtmux attach -t gamemaking`

Provided your friend’s key is already setup, all you have to do is copy the ngrok port and hostname and send it. Then, connecting should be as simple as running something like

`ssh -t -p PORT frienduser@HOST`

Now you’re ready to go! Have fun programming together!

Although the setup above works, there is something in it that really bothers me: you have to manually look at ngrok’s TUI and copy the hostname and port form it. C’mon, we’re in a Linux shell, the land of automation! Of course there is some better way to do that.

I read and reread ngrok’s help pages,there doesn’t seem to be possible to directly ask the binary for this information in any way. Luckily, I stumbled on this gist and this blog post while trying to circumvent it and good news: there is a way!

Apparently ngrok exposes its tunneling information on
`localhost:4040`

in JSON format. Thus, we can query it with a
bit of `sed`

magic:

```
curl -s http://localhost:4040/api/tunnels \
| sed -nE 's|.*"public_url":"tcp://([^:]*):([0-9]*)".*|\1 \2|p'
```

Now that we solved the last bit of manual work, we can put everything
together in a script that generates a shared tmux session, tunnels it
with ngrok and tells you how to remotely connect to it. Below is a
script based on the one found at the end of this
guide. Also notice that if `sshd`

service is inactive,
you will need the root password to start it.

```
#!/usr/bin/env bash
# Read parameters from command line arguments
# or use the same defaults as this post
frienduser="${1:-frienduser}"
tmux_file="${2:-/tmp/sharedtmux}"
tmux_session="${3:-gamemaking}"
tmux_group="${4:-tmux}"
# If sshd is inactive, begin by starting it
systemctl status sshd.service >/dev/null
if [[ $? -gt 0 ]]; then
sudo systemctl start sshd.service
fi
# Create shared tmux session
tmux -S "$tmux_file" new-session -s "$tmux_session" -d
# Assign right permissions to group
chgrp $tmux_group "$tmux_file"
chmod g+rw "$tmux_file"
# Start ngrok using the same tmux socket file
# but in a different session.
# This ensures the ngrok TUI will run non-blocking.
tmux -S "$tmux_file" new-session -s ngrok -d
tmux -S "$tmux_file" send -t ngrok "ngrok tcp 22" ENTER
# Wait a little while ngrok starts
sleep 2
# Query ngrok for host and port
ngrok_url='http://localhost:4040/api/tunnels'
query_ngrok() {
local ngrok_host ngrok_port
read ngrok_host ngrok_port <<< $(curl -s $ngrok_url \
| sed -nE 's|.*"public_url":"tcp://([^:]*):([0-9]*)".*|\1 \2|p')
echo "ssh -t -p $ngrok_port $frienduser@$ngrok_host"
}
# Echo the proper ssh command to tmux
tmux -S "$tmux_file" send -t "$tmux_session" "echo $(query_ngrok)" ENTER
# And also copy it to X clipboard for convenience
if command -v xclip &> /dev/null; then
query_ngrok | xclip
fi
# Attach to the shared session
tmux -u -S "$tmux_file" attach -t "$tmux_session"
```

- Using SSH and Tmux for screen sharing
- How to pair-program remotely and not lose the will to live
- ssh, tmux and vim: A Simple Yet Effective Pair Programming Setup
- rjz/ngrok_hostname.sh

More on this on another post (someday, I hope).↩︎

Probably, all you have to do is change

`pacman`

for your system’s package manager (`apt`

,`yum`

,`xbps`

etc.), maybe change the location of a couple configuration folders and substitute`systemctl`

by whatever service manager you may use.↩︎I don’t trust someone who may steal their machine and ssh into mine though.↩︎

Beware to not mix this up with

`/etc/ssh/ssh_config`

, the file where OpenSSH stores the*client*configuration.↩︎I know what you’re thinking… But I really need Windows for

~~gaming~~work.↩︎

Well, some days ago João
Paixão sent me a link to a paper from @calcStreams1998 called “Calculus in
Coinductive Form”. In it the authors show that if we look at Taylor
series as streams of real numbers, then solving these differential
equations become *as easy as writing them*!

Of course, I got really excited with the idea and had to turn it into code. After all, that is the epitome of declarative programming! The paper is really readable, and I urge you to take a look at it. There are many mathematical details that I will not touch and even a whole discussion on how these same techniques apply to Laplace transforms. As an appetizer of what we are going to do, consider the initial value problem

$$ y'' = -x^2 y + 2 y' + 4 \\ y(0) = 0,\; y'(0) = 1. $$

By the end of this post we gonna be able to solve this differential equation simply by writing the equivalent Haskell definition:

`y = 0 :> 1 :> (-x^2) * y + 2 * diff y + 4`

```
{-# LANGUAGE DeriveFunctor, DeriveFoldable, NoMonomorphismRestriction #-}
import Data.Foldable (toList)
```

First, a disclaimer: I will not deal with convergence issues in this post. For us everything will be beautiful and perfect and analytic. Of course, since ignoring these things makes a chill run down my spine, let’s just agree that every time I say “smooth” I actually mean “analytic in a big enough neighbourhood around whatever points we are considering”. Nice, now it’s calculus time.

Our basic tool is the all too famous Fundamental Theorem of Calculus.
Consider a smooth function *f*;
An application of the FTC tells us that any smooth *f* is uniquely determined by its
value at a point *a* and its
derivative:

*f*(*a*+*x*) = *f*(*a*) + ∫_{a}^{a + x}*f*′(*t*)*d**t*.

Ignoring all meaning behind the integral, derivative, evaluation etc. we can view this as a recipe: smooth functions are equivalent to pairs containing a number and another smooth function. I don’t know about you but this sounds a lot like a recursive datatype to me!

```
data Stream a = a :> Stream a
deriving (Functor, Foldable)
infixr 2 :>
```

The `Stream a`

datatype represents a list with an infinite
amount of elements. To see this all you have to do is unfold the
definition. In our context this means that a smooth function may be
represent by the infinite list of its derivatives at a point:

```
f = f a :> f'
= f a :> f' a :> f''
= f a :> f' a :> f'' a :> f'''
= ...
```

Since the constructor `(:>)`

is our way to represent
the TFC, the above amounts to saying that we can represent a
(sufficiently regular) function by its Taylor series. That is, by
applying the TFC recursively to the derivatives of *f*, we get

$$ f(a + x) = \sum_{k=0}^\infty f^{(k)}(a) \frac{x^k}{k!}. $$

As expected, the info that actually depends on *f* are only its derivatives at *a*. In math, we represent this as a
power series but, by linear independence, this is completely equivalent
to the stream of its coefficients in the basis {*x*^{k}/*k*!}.

With this we’re done. Well… We’re not actually done, there is still a lot of cool stuff I want to show you. Nevertheless, the definition above is already enough to replicate the power series method of solving ODEs. Don’t believe me? Let’s solve some familiar equations then.

The exponential function is the unique solution to *y*′ = *y*, *y*(0) = 1, which becomes the
following recursion in Haskell:

`= 1 :> ex ex `

Let’s check the starting coefficients of `ex`

on ghci to
confirm that they match the derivative of exp at zero: are exactly the derivatives of
the exp :

```
ghci> take 10 (toList ex)
[1,1,1,1,1,1,1,1,1,1]
```

The way to define sine and cosine as the solutions to a system of ODEs in Haskell becomes a mutually recursive definition:

```
= 0 :> cosine
sine = 1 :> fmap negate sine cosine
```

As expected, these streams follow the same alternating pattern of 0, 1, 0, − 1, … as the Taylor coefficients.

```
ghci> take 10 (toList sine)
[0,1,0,-1,0,1,0,-1,0,1]
ghci> take 10 (toList cosine)
[1,0,-1,0,1,0,-1,0,1,0]
```

Even though we know how to calculate the Taylor coefficients they’re
only means to an end. The main reason one wants to solve differential
equations is to calculate *functions*, not series. Let’s then
hack a poor man’s function approximation for these Taylor expansions.
For simplicity, I will use a fixed amount of 100 coefficients. This
works for this demonstration but in any real program, it is better to
call upon some analysis to find out the right amount of terms for your
desired error estimate. Let’s then create a higher order function that
converts streams of real numbers into real valued functions.

```
-- | Turn a Stream f into a functional approximation
-- of its Taylor series around a point a.
-- That is, eval a f ≈ f(a + x)
eval :: Fractional a => a -> Stream a -> a -> a
= foldr1 (\ fa f' -> fa + (x - a) * f') (take 100 taylor)
eval a f x where
= zipWith (/) (toList f) factorials
taylor = let fats = 1 : zipWith (*) fats [1..]
factorials in fmap fromIntegral fats
```

With our evaluator in hand, it’s time to test our previous streams into some well-known values:

```
ghci> eval 0 ex 0
1.0
ghci> eval 0 ex 1
2.718281828459045
ghci> fmap (eval 0 sine) [0, pi/2, pi, 2*pi]
[0.0,1.0,0.0,0.0]
ghci> fmap (eval 0 cosine) [0, pi/2, pi, 2*pi]
[1.0,0.0,-1.0,1.0]
```

Quite nice, huh? Just a few lines of code and we already have the power to solve and approximate some classical differential equations! All thanks to Haskell’s laziness and the TFC. Our solver is done, but the code still lacks a cleaner interface to manipulate streams and represent differential equations. Let’s define some functions to mitigate that.

From the previous discussion, we can get the derivative of a stream simply by dropping the first term.

```
-- | Taylor series representation of the derivative.
diff :: Stream a -> Stream a
:> f') = f' diff (_
```

It is possible to embed any constant as a stream with derivative
zero. Also, let’s define a stream `x`

representing the
identity function^{2} in order to make our equations look
a bit nicer.

```
-- | Taylor series for the constant zero.
zero :: Num a => Stream a
= 0 :> zero
zero
-- | Taylor series for the identity function `f x = x`.
x :: Num a => Stream a
= 0 :> 1 :> zero x
```

Finally, our fellow mathematicians and physicists that perhaps may
use this code will certainly want to do arithmetical manipulations on
the series. We can achieve that with the traditional `Num`

,
`Fractional`

and `Floating`

instances. As usual
with these Calculus posts, these instances correspond to the well-known
formulas for derivatives. Let’s start with the arithmetic classes.

```
instance Num a => Num (Stream a) where
-- Good ol' linearity
+) (fa :> f') (ga :> g') = fa + ga :> f' + g'
(-) (fa :> f') (ga :> g') = fa - ga :> f' - g'
(negate = fmap negate
-- Leibniz rule applied to streams
*) f@(fa :> f') g@(ga :> g') = fa * ga :> f' * g + f * g'
(fromInteger n = fromInteger n :> zero
abs = error "Absolute value is not a smooth function"
signum = error "No well-defined sign for a series"
instance Fractional a => Fractional (Stream a) where
-- The division rule from Calculus. We assume g(0) ≠ 0
/) f@(fa :> f') g@(ga :> g') = fa / ga :> (f' * g - f * g') / g^2
(fromRational n = fromRational n :> zero
```

For the `Floating`

instance, we will use the chain rule
and the fact that we know the derivatives for all methods in the class.
I recommend taking a look at the implementation we did in a previous post for Dual numbers.
They are strikingly similar, which is no coincidence of course. The main
idea is that applying an analytic *g* to the stream of *f* if the same as calculating the
derivates for *g* ∘ *f*.
Thus, all our `Floating`

methods will look like this:

`g f = g (f a) :> g' f * f'`

This is Haskell, so we can turn this idea into a higher order
function taking both `g`

and its derivative:

`@(fa :> f') = g fa :> g' f * f' analytic g g' f`

```
instance Floating a => Floating (Stream a) where
pi = pi :> zero
exp = analytic exp exp
log = analytic log recip
sin = analytic sin cos
cos = analytic cos (negate . sin)
asin = analytic asin (\x -> 1 / sqrt (1 - x^2))
acos = analytic acos (\x -> -1 / sqrt (1 - x^2))
atan = analytic atan (\x -> 1 / (1 + x^2))
sinh = analytic sinh cosh
cosh = analytic cosh sinh
asinh = analytic asinh (\x -> 1 / sqrt (x^2 + 1))
acosh = analytic acosh (\x -> 1 / sqrt (x^2 - 1))
atanh = analytic atanh (\x -> 1 / (1 - x^2))
```

With all those instances, we can give power series the same
first-class numeric treatment that they receive in mathematics. For
example, do you want to approximate some complicated integral? Just use
the Stream `x`

that we previously defined:

```
ghci> erf = 0 :> exp (-x^2)
ghci> take 10 (toList erf)
[0.0,1.0,-0.0,-2.0,0.0,12.0,0.0,-120.0,0.0,1680.0]
```

Also, we’ve only dealt with linear equations until now but as long as everything is analytic, these methods readily extend to non-linear equations.

```
ghci> y = 0 :> 1 :> x^2 * cos (diff y) - x * sin y
ghci> take 10 (toList y)
[0.0,1.0,0.0,0.0,-0.9193953882637205,0.0,4.0,20.069867797120825,-6.0,-265.9036412154172]
```

Finally, we should discuss a couple caveats of this method. Solving
an ODE through a Taylor series can be slow… That is why, in practice,
this would only be used for the most well-behaved equations. There is
also the issue of convergence that we decided to ignore during this
post. Not all `Floating a => a -> a`

functions are
analytic everywhere and when this hypothesis doesn’t hold, the method
will just silently fail and return garbage such as `infinity`

or `NaN`

for the coefficients. Nevertheless, this “automatic”
solution is pretty much equivalent to what one would do to solve this
kind of equation by hand, including these same issues. In fact, I would
even risk saying that the Haskell internals are much more optimized than
one could hope to be when solving by hand.

To sum everything up, I want to note a cool fact that I’ve only realized after writing the entire post: there is a direct relationship between this method of solving ODEs and forward-mode automatic differentiation!

When working with dual
numbers, we define its numeric instances to obey *ε*^{2} = 0, implying that for
any analytic function it satisfies

*f*(*a*+*ε*) = *f*(*a*) + *f*′(*a*)*ε*.

This is equivalent to Taylor expanding *f* around *a* and truncating the series after
the first order terms. However, nothing really forces us to stop there!
Since the derivative of an analytic function is also analytic, we can
again Taylor expand it to get the second derivative and so on. By
recursively repeating this procedure we get the entire Taylor expansion.
So, if instead of using a term *ε* that vanishes at second order, we
apply *f* to *a* + *x*, we get all
derivatives of *f* at *a*.

This is the same we have been doing all along with Streams; the only
difference being that we write `a :> 1`

to represent *a* + *x*. So, similarly with
dual numbers, we can define a procedure to calculate *all
derivatives* of a polymorphic *f* at the point *a* by applying *f* to a suitable Stream.

```
-- | A Stream with all derivatives of f at a.
= f (a :> 1) diffs f a
```

This post gained life thanks to the enthusiasm of João Paixão and Lucas Rufino. João sent me the paper and we three had some fun chats about its significance, including becoming perplexed together about how little code we actually needed to implement this.

The Numeric.AD.Mode.Tower module of the ad package.