# Approximation by a Thousand Cuts

An interactive guide to polyhedral representations22 September 2023

As the late hours of the night embrace you, you find yourself seated at your workbench, the substantial dose of caffeine you recently consumed still coursing through your veins with unwavering energy. Then, in a sudden sparkle of inspiration, you stumble upon it–—the elusive function you so much needed. It’s a remarkable little function f : \mathbb{R}^n \to \mathbb{R}, that fits perfectly with all the simulations you intend to run.

You already envision it: the number crunching, the evaluations, the convergence. There is a problem, however. This function, you realize, is as real-valued as it gets, constructed with infinite precision, while all your code and simulations reside in the digital world of a computer’s memory, capable of storing only a finite number of bits. What can one do in such a situation?

Fortunately, mathematicians all over the world have already studied a myriad of methods for approximating functions in a computer. The answer depends, most of all, on what your objectives are, because each kind of approximation only preserves some characteristics, and is, thus, only appropriate for certain applications. If being accurate in some known points is your primary concern, polynomial interpolation may be your answer. On the other hand, if you worry about extrapolating into new datasets, a neural network might prove to be a better fit. Or if your aim is not evaluation but measurement through filters, projection into the Fourier domain or another functional basis could be the path to pursue.

In this post, we will explore *approximations by
cuts*, a technique for representing functions that
is suitable for optimization problems. This is a
well-known tool within the Operations Research community
and, I believe, is also full of potential in various
other domains of engineering and mathematics.

Before we delve deeper into the world of cuts, you can stop and play with the interactive diagram below. Each click on a point in the first panel improves the approximation by cuts on the other one.

## Approximation by cuts

Cutting planes have widespread usage on the Operations Research community because they play really well with optimization problems. Nevertheless, I unfortunately don’t see them being used quite as much in other branches of engineering or mathematics. Thus, before talking about why they are so cool, let’s first present tonight’s star.

A **cut** for f : \mathbb{R}^n \to
\mathbb{R} is an affine function everywhere below
x. That is,

f(x) \ge \left\langle a,\, x\right\rangle + b.

In particular, we say that a cut is *tight* at
x_0 if it equals the
function at this point. In this case, it is common to
denote the cut as centered around x_0:

f(x) \ge f(x_0) + \left\langle a,\, x - x_0\right\rangle.

In general, a single cut is a terrible approximation.
But it is not its fault, after all, it is just a first
order polynomial! The magic begins when you combine many
cuts to represent f as
a *polyhedral function*. Taking an approximation
of f by cuts C = \{\left\langle
a,\,\cdot\right\rangle + b\} amounts to choosing
the largest cut at each given point:

f(x) \approx \max\limits_{(a,b) \in C} \left\langle a_,\,x\right\rangle + b.

This formula also explains why we refer to affine functions as “cuts”. Think about the graph of f. The graph of a cut is a hyperplane dividing the space in two halves: points above and below it. To form the polyhedral function, we slice the space, one cut at a time, in order to carve a polyhedron containing the graph of f.

Click anywhere in the figure below to carve the space until the function’s shape is recognizable. You start with no cut and, as you click, the polyhedral approximation improves until the function’s graph become recognizable. A fun game is to try to do this with the least amount of cuts possible.

### Why I like cuts and you should too

What are the nice properties that an approximation by cuts has? Why is it used so much by the optimization and operations research folks? Let’s go through some of the reasons.

**Underapproximation**: Since all cuts are below f, their maximum also has to be. Thus, cutting planes are a good tool for estimating lower bounds, which is a necessity in optimization.**Easy to improve**: Approximations by cuts can be refined without much hassle. Whenever you obtain a new cut \phi, all you have to do is update your bag of cuts from C to C \cup \{\phi\}, and it’s done. This allows for algorithms based on iterative refinement of the approximation.**Convexity**: Every polyhedral function is convex, and convex functions are full of nice properties needed for optimization. Furthermore, one can prove that as you add more cuts any such approximation converges to the tightest convex function everywhere below f.I also find it worth commenting that this property works as a double-edged sword: since polyhedral functions are always convex, you cannot guarantee a satisfactory approximation for a non-convex function. Keep this in mind when using cuts.

**Linear Programming formulation**: Minimizing a polyhedral function is among the easiest kinds of problems to optimize, because it can be represented as a Linear Program, So if f is some complicated convex function, but you already have a good approximation by cuts for it, you can minimize it instead and get an approximation for the optimum.\begin{aligned} \min_x f(x) &\ge \begin{array}{rl} \min\limits_{x} &\max\limits_{(a,b) \in C} \left\langle a_,\,x\right\rangle + b \\ \end{array} \\ &= \begin{array}{rl} \min\limits_{x, t} & t \\ \textrm{s.t.} & \left\langle a_,\,x\right\rangle + b \le t,\, \forall (a,b) \in C \\ \end{array} \end{aligned}

These properties make cuts an excellent tool for estimating lower bounds on optimization problems, which is a necessity for any optimizer that “sandwiches” the optimal value between two bounds. Let’s now put all these properties to good use.

#### Example: Kelley’s cutting planes algorithm

Imagine that you must minimize a convex function, but, unfortunately, you only have access to a linear programming solver such a Gurobi, Xpress or GLPK. How do you do that?

Fortunately, one can minimize an arbitrary convex function by iteratively solving linear programs. For it to work, we will have to suppose (by now) that we have access to some oracle capable of both evaluating a function and calculating a tight cut for it. This is not cheating though, because in the section Cuts from Derivatives we will learn how to write such an oracle.

Besides the function f, the algorithm needs three more inputs: a tolerance \epsilon, an initial point (can be any point) to evaluate, and a uniform bound \forall x,\, M \le f(x). Since we are minimizing, this bound must exist for the problem to be well-posed. We begin with a polyhedral approximation \tilde{f}_0 which equals M everywhere. The idea is to, at each iteration, give the oracle a point x_n. This will return an evaluation f(x_n), which serves as an upper bound for \min f because all evaluations are larger than the minimum. You also get a tight cut c_n that can be used to improve your approximation to \tilde{f}_n. This way, we squeeze the optimal value between a sequence of bounds:

\min_n f(x_n) \ge \min_x f(x) \ge \min_x \tilde{f}_n(x).

The algorithm can also be more directly described in Julia code:

```
function minimize(f::Function, initial_bound, x = 0; tolerance)
= Polyhedral(initial_bound)
opt = +Inf
ub = initial_bound
lb
while ub - lb > tolerance
# Ask oracle for evaluation and cut at x
= oracle(f, x)
fx, cut
# Improve upper bound
= min(ub, fx)
ub
# Improve lower bound
add_cut!(opt, cut)
= solve_lp(opt) # Also updates the best solution
x, lb end
return x, fx # Optimal point and optimal value
end
```

In practice there are better ways to do unconstrained optimization, but this idea of sandwiching with evaluations and cuts works as a building block to more complex optimization algorithms such as Benders’ decomposition or Stochastic Dual Dynamic Programming.

## A bit of convex geometry

So, where does this idea of cuts comes from? Did some mathematician in the fifties cut his finger while preparing lunch and suddenly became illuminated? Well, in fact, the idea of cutting planes is pretty central for convex analysis. Let’s thus make a small detour into the world of convex sets to see how cuts arise.

We will only go over the necessary results in this
post. For a more comprehensive overview of convex
geometry (or for actually learning it) I recommend the
great book by Stephen P. Boyd and Lieven
Vandenberghe^{1}.
Alright, time for some definitions.

A set X is
**convex** if, given two points x, y \in X, it also contains
the entire line segment between them.

Algebraically, for any \lambda \in [0, 1], \lambda x + (1-\lambda) y \in X.

My intuition for convex sets is that they are sets without holes nor wobbles. This encompasses a lot of the good ol’ shapes you are familiar with, such as planes, balls, and polyhedra.

In this post, we will also talk a lot about hyperplanes and their relation to convex sets. It is useful to think about a hyperplane interchangeably as a geometrical and an analytical object. We can view it as an affine function’s zero set

H = \{ x \mid \left\langle a,\,x\right\rangle + b = 0 \}

Or as the affine function itself. The important part is that a hyperplane divides the space in two halves, depending on whether the affine function is non-negative or not.

This division is specially important for convex sets, because, as a consequence of their regular shape, one can always pass a hyperplane between them in such a way that each lies in one of the halves.

For any pair of disjoint non-empty convex sets X, Y, there is a hyperplane dividing the space in a way that each one lies in one of its sides.

In other words, there exists an affine function \left\langle a,\,\cdot\right\rangle + b such that for all x \in X and y \in Y, \left\langle a,\,x\right\rangle + b \le 0 \quad \text{and}\quad \left\langle a,\,y\right\rangle + b \ge 0.

The proof to this result is more technical than
elucidating^{2}, so we
are going to skip over it on this post. You can find a
great presentation with all details on Boyd and
Vandenberghe^{3}’s
book. Fortunately, this is one of these results were we
can get a good intuition from a picture: due to their
absence of wobbles, one can pass a plane between two
convex sets without bumping into anything.

Have fun trying to drag the sets below to a position with no separating hyperplane.

Let’s turn our focus to the case where one of the sets is a single point. An interesting consequence of this theorem is that given any point p not in the set, we can find an affine function that is zero in this point and contains a convex set in one of its half-spaces. It is really similar to a cut, but for a set. For example, in the image below, you can choose a separating hyperplane between your mouse and the convex set in the middle.

Now, what happens when we choose our point to be in
the set’s boundary? In this case, we acquire a
*supporting hyperplane*. That is, a tangent
hyperplane that has the entire set in one of its
sides.

A **supporting hyperplane** at a point
x in the boundary of
X is a tangent
hyperplane that touches X at x and is non-negative at
every other point of X.

This is starting to look a bit like cuts, right? Very well, from the Separating Hyperplane Theorem, we can conclude a similar theorem saying that convex sets have supporting hyperplanes at all points in their boundary.

A convex set X has a supporting hyperplane for any point x_0 on its boundary. That is, there is an affine function \left\langle a,\,\cdot\right\rangle + b such that

\left\langle a,\,x_0\right\rangle + b = 0 \quad \text{and}\quad \forall x \in X,\, \left\langle a,\,x\right\rangle + b \ge 0.

If X has nonempty interior, there is a hyperplane separating \{x_0\} from the interior of X, and if it is empty, the entirety of X lies inside in an affine subspace of \mathbb{R}^n, and we can take this subspace as our hyperplane.

Notice that this theorem has the equivalent formulation as \forall x \in X,\, \left\langle a,\,x\right\rangle \ge \left\langle a,\,x_0\right\rangle.

## From convex sets to convex functions

After our detour on convex sets, it’s time to investigate how to convert these results about sets and hyperplanes into results about functions and cuts. After all, that is what we are interested in here.

Foremost, one remark: for simplicity of presentation
we will work with *extended real functions*.
These are functions that, besides the usual real
numbers, can evaluate to \infty. The advantage of this
approach is that we can just assume that all functions
are defined everywhere on \mathbb{R}^n. We call their
*domain* wherever they are finite and define them
as infinity elsewhere.

The most common definition in the wild for a
*convex function* is through *Jensen’s
inequality*,

f\left(\lambda x + (1 - \lambda) y\right) \le \lambda f(x) + (1 - \lambda) f(y).

Nonetheless, convex functions are notorious for having a multitude of equivalent definitions. Hence, for our purposes it will be better to choose a more geometric one.

A function is **convex** if its epigraph
is a convex set.

But what is this *epigraph* thing after all?
Well, everybody knows a function’s graph,

\operatorname{\mathrm{graph}}(f) = \left\{ (x, y) \in \mathbb{R}^{n+1} \mid f(x) = y \right\}.

The epigraph adjoins the suffix *epi-*, from
Greek ‘ἐπί’ meaning “above”
or “on top of”, to denote the set of all points
*above the graph*.

\operatorname{\mathrm{epi}}(f) = \left\{ (x, y) \in \mathbb{R}^{n+1} \mid f(x) \le y \right\}.

Besides sharing an etymology with *epic*, the
epigraph also lets us easily translate results from
convex sets to convex functions. For example, the
supporting hyperplane theorem translates into a
statement about tight cuts for convex function!

A convex function f has a tight cut at any point in its domain.

Visually this theorem looks like the figure below. You can hover it to view the cut for each point.

The graph of f lies in the boundary of its epigraph, which is convex. Therefore, from the supporting hyperplane theorem, for any x_0, the epigraph of f has a supporting hyperplane touching it at (x_0, f(x_0)). Since the tangent to a graph is non-vertical, the last component of this hyperplane cannot be zero. We can represent this supporting hyperplane as

\begin{aligned} \left\langle a,\,x\right\rangle + bf(x) &\ge \left\langle a,\,x_0\right\rangle + bf(x_0) \\ f(x) &\ge f(x_0) + \left\langle-\frac{1}b a,\,x - x_0\right\rangle. \end{aligned}

By defining \lambda = -\frac{{1}}{b} a, we arrive at

f(x) \ge f(x_0) + \left\langle\lambda,\,x - x_0\right\rangle.

With this theorem we finally achieved all we need to be able to calculate, not just postulate, cuts for a large set of functions.

### Cuts from derivatives

Cuts are a special case of tangent hyperplane which does not cross the functions graph anywhere. This means that they have everything to do with derivatives. Indeed, we just arrived into our first method for calculating cuts.

Suppose f is convex and differentiable at x_0. From the convexity it has a tight cut with inclination \mu

f(x) \ge f(x_0) + \left\langle\mu,\,x - x_0\right\rangle.

This cut is a tangent hyperplane at x_0, and, from differentiability, this hyperplane is unique with inclination given by the derivative. Therefore

\boxed{\forall x,\, f(x) \ge f(x_0) + \left\langle\operatorname{\mathrm{d}}\!{f}(x_0),\,x - x_0\right\rangle.}

Now that’s a formula! Recall our sandwich algorithm. There I asked you to take in good faith my words about our oracle, and now It’s finally time to pay my debt and turn that black box into a simple procedure.

All we need is a `derivative`

Julia
function that calculates the derivative of
`f`

at a point `x`

. There are lots
of packages which do that using either automatic or
numerical differentiation. I’ll just let you choose your
favorite. Our oracle then becomes a simple function that
assembles the cut.

```
function oracle(f, a)
= f(a)
fa = derivative(f, a)
dfx
return fa, Cut(a, fa, dfa) # Cut(x) = fa + dot(dfa, x - a)
end
```

### Cuts from Lagrangian duality

We have learned how to calculate cuts for a convex
function at any point of differentiability.
Nevertheless, the supporting hyperplane theorem
guarantees a cut at *every point* on the
function’s domain. How can we also calculate cuts where
f is not
differentiable?

If you are coming from an area of mathematics where everything is smooth, this last question may sound strange. Nevertheless, in operations research and optimization, it is rather common to stumble into non-smooth functions. Take the polyhedral functions, for example. They almost always have edges and corners, nonetheless, it should be easy to calculate cuts for them. They are made out of linear pieces, after all!

Our focus on this section will be on calculating cuts
for *optimal value functions*, that is, functions
defined as the optimal value of a parameterized
optimization problem, such as those one would encounter, for
example, in dynamic programming.

\begin{array}{rl} f(x) = \min\limits_{u} & c(u) \\ \textrm{s.t.} & (x, u) \in X. \end{array}

This last definition may, at first, seem too restrictive. Perhaps even more than requiring differentiability. There is a small trick, however, that lets us write any function as an optimal value function:

\begin{array}{rl} f(x) = \min\limits_{u} & f(u) \\ \textrm{s.t.} & u = x. \end{array}

If this feels like cheating, it’s because it certainly is! In practice, exchanging evaluation by optimization may be rather bad for performance. From a theoretical point-of-view, however, this representation always works and we are all safe.

The advantage of concentrating on optimal value
functions is that it opens to us a whole world of tools
from Optimization. In particular, we will see that Lagrangian
duality has everything to do with cuts.^{4}

#### Best cut given an inclination

Suppose you have a function f and an inclination \lambda. This inclination defines a whole family of cuts for f by varying the constant term in

f(x) \ge \underbrace{b}_\text{this term} + \left\langle\lambda,\, x - x_0\right\rangle.

Of course, as good optimizers that we are, we won’t be happy with just any cut. We want to find the best constant b possible given the inclination \lambda. And, just to keep things straight: best in this context means that b makes the cut tight.

Assuming a valid cut, we can quantify how close it is to f by calculating the minimum distance between both functions in their entire domain:

\min_y \left| f(y) - \big(b + \left\langle\lambda,\, y - x_0\right\rangle\big) \right|.

We can drop the absolute value in the distance above because valid cuts are everywhere below f, making the difference always non-negative. Also, for the cut to be tight, it must equal the function at a certain point and, consequently, have a zero gap. Let’s call this optimal b = L(x_0; \lambda) to mark its dependence on the parameters. This leaves us with

\begin{aligned} 0 &= \min_y f(y) - \big(L(x_0; \lambda) + \left\langle\lambda,\, y - x_0\right\rangle\big) \\ L(x_0; \lambda) &= \min_y f(y) + \left\langle\lambda,\, x_0 - y\right\rangle. \end{aligned}

Recall that we are considering f to be an optimal value function. Since L(x_0, \lambda) is also defined by an optimization problem, let’s unroll its definition into a single minimization.

L(x_0; \lambda) = \begin{array}{rl} \min\limits_{y, u} & c(u) + \left\langle\lambda,\, x_0 - y\right\rangle \\ \textrm{s.t.} & (y, u) \in X. \end{array}

Thus, we can turn an inclination into a tight cut by solving an optimization problem that is, in general, not much different from f itself. Also, the formula for the optimal constant that we just arrived is common enough in the literature to deserve its own definition.

The **Lagrangian relaxation** of an
optimal value function f is the optimal value
function
\begin{array}{rl}
L(x; \lambda) = \min\limits_{y, u} & c(u) +
\left\langle\lambda,\, x - y\right\rangle \\
\textrm{s.t.} & (y, u) \in X.
\end{array}

The previous definition is the basis for
*Lagrangian duality*, a tool known for its
applications in continuous optimization. In this
context, it is customary to call the inclination \lambda a *Lagrange
multiplier* The name “relaxation” comes from
interpreting the transition from f to L as relaxing the hard
constraint (x, u) \in X
into a linear penalty on how much the solution diverts
from the original feasible set.

The above means that the relaxation is an underapproximation to f at all points,

f(x) \ge L(x; \lambda).

Furthermore, it is itself affine. With respect to any point x_0, the relaxation is expressible as

L(x; \lambda) = L(x_0; \lambda) + \left\langle\lambda,\, x - x_0\right\rangle,

From the two properties above, we can deduce the already known fact that the Lagrangian relaxation defines a cut for f guaranteed to be tight:

\boxed{f(x) \ge L(x_0; \lambda) + \left\langle\lambda,\, x - x_0\right\rangle.}

Notice though that the cut above is not necessarily tight at x_0! The point of intersection between L and f depends on the multiplier \lambda.

#### Best cut given a point

By first fixing the inclination and only then solving
the Lagrangian relaxation, we lose control of where the
cut touches the original function. Hence, a natural
question to ask is, given a parameter x, what inclination yields
the tightest relaxation at this point? This is called
the *dual value function* and amounts to a
minimax problem:

\begin{aligned} \check{f}(x) &= \max\limits_{\lambda} L(x ; \lambda) \\ &= \max\limits_{\lambda} \min\limits_{(y, u) \in X} c(u) + \left\langle\lambda,\, x - y\right\rangle. \end{aligned}

That thing looks scary but, fortunately for us, all
solvers are pretty good at it. In fact, for the most
common types of optimization programs, the dual problem
has a known closed form and all state-of-the-art
algorithms in (continuous) optimization^{5} solve for the primal
and the dual problems at the same time. Who would say
that optimizing two function is easier than a single
one?

Since \check{f} is
the maximum of lower approximations to f, it is also a lower
approximation, a result known as *weak
duality*.

\boxed{f(x) \ge \check{f}(x)}

It is also guaranteed to be convex, for it is the maximum of affine functions. Did we just say it is a convex lower approximation constructed from cuts? It looks like we are getting to something.

Moreover, one can prove that the dual function is not just some ordinary convex function, but the tightest convex function everywhere below f. So the cuts for \check{f} are also the best ones possible for f at a certain point. The best part is that these cuts come for free from solving the optimization problem because they are formed by the dual value \check{f}(x) and its associated multiplier \lambda_x.

There is a cut for f at the point x_0 defined by the dual value \check{f}(x_0) and optimal dual multiplier \lambda_0. That is,

\forall x, f(x) \ge \check{f}(x_0) + \left\langle\lambda_0,\, x - x_0\right\rangle.

The dual value function is the maximum among all Lagrangian relaxations. Let’s look at it centered at x_0.

\check{f}(x) = \max\limits_{\lambda} L(x_0; \lambda) + \left\langle\lambda,\, x - x_0\right\rangle.

The maximum above selects the best \lambda for the chosen parameter x. Therefore, its solution is above any other choice of \lambda we can make, including our desired \lambda_0.

\check{f}(x) \ge L(x_0; \lambda_0) + \left\langle\lambda_0,\, x - x_0\right\rangle.

Now the term L(x_0; \lambda_0) is the relaxation at x_0 with the optimal multiplier \lambda_0 at this point, which is precisely the definition of \check{f}(x_0). This yields a tight cut for the dual function.

\check{f}(x) \ge \check{f}(x_0) + \left\langle\lambda_0,\, x - x_0\right\rangle.

From weak duality, it follows that it is also a cut for f(x).

f(x) \ge \check{f}(x) \ge \check{f}(x_0) + \left\langle\lambda_0,\, x - x_0\right\rangle.

Perfect! We just discovered that the solutions to the dual problem are cuts for the primal problem. And, best of all, solving an optimization problem in general already provides us with the dual solutions, meaning that we get cuts for free. How cool is that?

Of course, now is the time for you to point your
finger towards me and say that there is no tightness
guarantee in the previous theorem. Although the dual is
the best relaxation possible, it could still be too
below the primal value function. But remember that we
discussed above that \check{f} is the
*tightest* convex function approximating f from below. Thus, they
should be equal whenever f is convex. This is an
important result called *strong duality*, which
states that \check{f}’s
epigraph is the closed convex hull of f’s epigraph.^{6} Therefore, now we
also know how to calculate tight cuts for the points of
non-differentiability of a convex function!

f\;\text{convex} \implies \forall x, f(x) \ge f(x_0) + \left\langle\lambda_0,\, x - x_0\right\rangle.

As a bonus, if you like automatic differentiation as much as I do, you also just learnt how to calculate the derivative for any function performing convex optimization during its execution. Suppose that f is differentiable at x_0. Since the derivative is unique and the cut provides us with a tangent, it follows that the derivative equals the optimal multiplier \nabla f(x_0) = \lambda_0. Thus, by solving a parameterized (convex) optimization problem, you gain both an evaluation and a derivative. Now all you have to do is plug it into the chain rule et voilà!

## Farewell

Today you gained a new tool to your repertoire of functional representations. We have learned how cutting planes are a computationally amenable structure that harmoniously integrate with convexity and optimization, and how we can use them to represent functions for minimization. And, best of all, if you are working with convex optimization or calculating any derivatives, you already have some cuts lying around for free!

I am also planning some further posts were we will explore how to combine cuts with value and policy iteration to extend dynamic programming to infinite dimensional state spaces, and how to put to good use the various available ways to calculate cuts for mixed integer programs.

Good convergence for y’all and stay tuned!

## Acknowledgements

I want to thank Ivani Ivanova and Pedro Xavier for commenting and proofreading this post’s first draft.

*Convex Optimization*(Cambridge, UK ; New York: Cambridge University Press, 2004), chap. 2, https://web.stanford.edu/~boyd/cvxbook/.↩︎For the Functional Analysts lurking around: in infinite dimension, it is equivalent to the Hahn-Banach Theorem.↩︎

*Convex Optimization*.↩︎For an overview of duality, you can (again) check Boyd and Vandenberghe, ibid. And for duality in the context of optimal value functions, there is also my M.Sc. thesis. (Iago Leal de Freitas, “Convexification by Averages” (Master’s thesis, Universidade Federal do Rio de Janeiro, 2019), https://iagoleal.com/data/mscthesis.pdf)↩︎

And consequently all free or commercial solvers in the market.↩︎

Technically, the theorem requires f to be a proper convex

*lower semicontinuous*function. But in practice, convexity tends to be the only part you need to worry about.↩︎