Approximation by a Thousand Cuts

An interactive guide to polyhedral representations

22 September 2023

As the late hours of the night embrace you, you find yourself seated at your workbench, the substantial dose of caffeine you recently consumed still coursing through your veins with unwavering energy. Then, in a sudden sparkle of inspiration, you stumble upon it–—the elusive function you so much needed. It’s a remarkable little function f : \mathbb{R}^n \to \mathbb{R}, that fits perfectly with all the simulations you intend to run.

You already envision it: the number crunching, the evaluations, the convergence. There is a problem, however. This function, you realize, is as real-valued as it gets, constructed with infinite precision, while all your code and simulations reside in the digital world of a computer’s memory, capable of storing only a finite number of bits. What can one do in such a situation?

Fortunately, mathematicians all over the world have already studied a myriad of methods for approximating functions in a computer. The answer depends, most of all, on what your objectives are, because each kind of approximation only preserves some characteristics, and is, thus, only appropriate for certain applications. If being accurate in some known points is your primary concern, polynomial interpolation may be your answer. On the other hand, if you worry about extrapolating into new datasets, a neural network might prove to be a better fit. Or if your aim is not evaluation but measurement through filters, projection into the Fourier domain or another functional basis could be the path to pursue.

In this post, we will explore approximations by cuts, a technique for representing functions that is suitable for optimization problems. This is a well-known tool within the Operations Research community and, I believe, is also full of potential in various other domains of engineering and mathematics.

Before we delve deeper into the world of cuts, you can stop and play with the interactive diagram below. Each click on a point in the first panel improves the approximation by cuts on the other one.

Approximation by cuts

Cutting planes have widespread usage on the Operations Research community because they play really well with optimization problems. Nevertheless, I unfortunately don’t see them being used quite as much in other branches of engineering or mathematics. Thus, before talking about why they are so cool, let’s first present tonight’s star.

A cut for f : \mathbb{R}^n \to \mathbb{R} is an affine function everywhere below x. That is,

f(x) \ge b + \left\langle\lambda,\, x\right\rangle.

In particular, we say that a cut is tight at x_0 if it equals the function at this point. In this case, it is usual to denote the cut as centered around x_0:

f(x) \ge f(x_0) + \left\langle\lambda,\, x - x_0\right\rangle.

In general, a single cut is a terrible approximation. But it is not its fault, after all, it is just a first order polynomial! The magic begins when you combine many cuts to represent f as a polyhedral function. Taking an approximation of f by cuts C = \{b + \left\langle\lambda,\,\cdot\right\rangle\} amounts to choosing the largest cut at each given point:

f(x) \approx \max\limits_{(\lambda, b) \in C} b + \left\langle\lambda,\,x\right\rangle.

This formula also explains why we refer to affine functions as “cuts”. Think about the graph of f. The graph of a cut is a hyperplane dividing the space in two halves: points above and below it. To form the polyhedral function, we slice the space, one cut at a time, in order to carve a polyhedron containing the graph of f.

Click anywhere in the figure below to carve the space until the function’s shape is recognizable. You start with no cut and, as you click, the polyhedral approximation improves until the function’s graph become recognizable. A fun game is to try to do this with the least amount of cuts possible.

Why I like cuts and you should too

What are the nice properties that an approximation by cuts has? Why is it used so much by the optimization and operations research folks? Let’s go through some of the reasons.

Underapproximation: Since all cuts are below f, their maximum also has to be. Thus, cutting planes are a good tool for estimating lower bounds, which is a necessity in optimization.
Easy to improve: Approximations by cuts can be refined without much hassle. Whenever you obtain a new cut \phi, all you have to do is update your bag of cuts from C to C \cup \{\phi\}, and it’s done. This allows for algorithms based on iterative refinement of the approximation.
Convexity: Every polyhedral function is convex, and convex functions are full of nice properties needed for optimization. Furthermore, one can prove that as you add more cuts any such approximation converges to the tightest convex function everywhere below f.

I also find it worth commenting that this property works as a double-edged sword: since polyhedral functions are always convex, you cannot guarantee a satisfactory approximation for a non-convex function. Keep this in mind when using cuts.
Linear Programming formulation: Minimizing a polyhedral function is among the easiest kinds of problems to optimize, because it can be represented as a Linear Program, So if f is some complicated convex function, but you already have a good approximation by cuts for it, you can minimize it instead and get an approximation for the optimum.

\begin{aligned} \min_x f(x) &\ge \begin{array}{rl} \min\limits_{x} &\max\limits_{(\lambda,b) \in C} b + \left\langle\lambda,\, x\right\rangle \\ \end{array} \\ &= \begin{array}{rl} \min\limits_{x, t} & t \\ \textrm{s.t.} & b + \left\langle\lambda,\,x\right\rangle \le t,\, \forall (\lambda,b) \in C \\ \end{array} \end{aligned}

These properties make cuts an excellent tool for estimating lower bounds on optimization problems, which is a necessity for any optimizer that “sandwiches” the optimal value between two bounds. Let’s now put all these properties to good use.

Example: Kelley’s cutting planes algorithm

Imagine that you must minimize a convex function, but, unfortunately, you only have access to a linear programming solver such a Gurobi, Xpress or GLPK. How do you do that?

Fortunately, one can minimize an arbitrary convex function by iteratively solving linear programs. For it to work, we will have to suppose (by now) that we have access to some oracle capable of both evaluating a function and calculating a tight cut for it. This is not cheating though, because in the section Cuts from Derivatives we will learn how to write such an oracle.

Besides the function f, the algorithm needs three more inputs: a tolerance \epsilon, an initial point (can be any point) to evaluate, and a uniform bound \forall x,\, M \le f(x). Since we are minimizing, this bound must exist for the problem to be well-posed. We begin with a polyhedral approximation \tilde{f}_0 which equals M everywhere. The idea is to, at each iteration, give the oracle a point x_n. This will return an evaluation f(x_n), which serves as an upper bound for \min f because all evaluations are larger than the minimum. You also get a tight cut c_n that can be used to improve your approximation to \tilde{f}_n. This way, we squeeze the optimal value between a sequence of bounds:

\min_n f(x_n) \ge \min_x f(x) \ge \min_x \tilde{f}_n(x).

The algorithm can also be more directly described in Julia code:

function minimize(f::Function, initial_bound, x = 0; tolerance)
  opt = Polyhedral(initial_bound)
  ub  = +Inf
  lb  = initial_bound

  while ub - lb > tolerance
    # Ask oracle for evaluation and cut at x
    fx, cut = oracle(f, x)

    # Improve upper bound
    ub = min(ub, fx)

    # Improve lower bound
    add_cut!(opt, cut)
    x, lb = solve_lp(opt)   # Also updates the best solution
  end

  return x, fx              # Optimal point and optimal value
end

In practice there are better ways to do unconstrained optimization, but this idea of sandwiching with evaluations and cuts works as a building block to more complex optimization algorithms such as Benders’ decomposition or Stochastic Dual Dynamic Programming.

A bit of convex geometry

So, where does this idea of cuts comes from? Did some mathematician in the fifties cut his finger while preparing lunch and suddenly became illuminated? Well, in fact, the idea of cutting planes is pretty central for convex analysis. Let’s thus make a small detour into the world of convex sets to see how cuts arise.

We will only go over the necessary results in this post. For a more comprehensive overview of convex geometry (or for actually learning it) I recommend the great book by Stephen P. Boyd and Lieven Vandenberghe¹. Alright, time for some definitions.

A set X is convex if, given two points x, y \in X, it also contains the entire line segment between them.

Algebraically, for any \lambda \in [0, 1], \lambda x + (1-\lambda) y \in X.

My intuition for convex sets is that they are sets without holes nor wobbles. This encompasses a lot of the good ol’ shapes you are familiar with, such as planes, balls, and polyhedra.

In this post, we will also talk a lot about hyperplanes and their relation to convex sets. It is useful to think about a hyperplane interchangeably as a geometrical and an analytical object. We can view it as an affine function’s zero set

H = \{ x \mid \left\langle a,\,x\right\rangle + b = 0 \}

Or as the affine function itself. The important part is that a hyperplane divides the space in two halves, depending on whether the affine function is non-negative or not.

This division is specially important for convex sets, because, as a consequence of their regular shape, one can always pass a hyperplane between them in such a way that each lies in one of the halves.

For any pair of disjoint non-empty convex sets X, Y, there is a hyperplane dividing the space in a way that each one lies in one of its sides.

In other words, there exists an affine function \left\langle a,\,\cdot\right\rangle + b such that for all x \in X and y \in Y, \left\langle a,\,x\right\rangle + b \le 0 \quad \text{and}\quad \left\langle a,\,y\right\rangle + b \ge 0.

The proof to this result is more technical than elucidating², so we are going to skip over it on this post. You can find a great presentation with all details on Boyd and Vandenberghe³’s book. Fortunately, this is one of these results were we can get a good intuition from a picture: due to their absence of wobbles, one can pass a plane between two convex sets without bumping into anything.

Have fun trying to drag the sets below to a position with no separating hyperplane.

Let’s turn our focus to the case where one of the sets is a single point. An interesting consequence of this theorem is that given any point p not in the set, we can find an affine function that is zero in this point and contains a convex set in one of its half-spaces. It is really similar to a cut, but for a set. For example, in the image below, you can choose a separating hyperplane between your mouse and the convex set in the middle.

Now, what happens when we choose our point to be in the set’s boundary? In this case, we acquire a supporting hyperplane. That is, a tangent hyperplane that has the entire set in one of its sides.

A supporting hyperplane at a point x in the boundary of X is a tangent hyperplane that touches X at x and is non-negative at every other point of X.

This is starting to look a bit like cuts, right? Very well, from the Separating Hyperplane Theorem, we can conclude a similar theorem saying that convex sets have supporting hyperplanes at all points in their boundary.

A convex set X has a supporting hyperplane for any point x_0 on its boundary. That is, there is an affine function \left\langle a,\,\cdot\right\rangle + b such that

\left\langle a,\,x_0\right\rangle + b = 0 \quad \text{and}\quad \forall x \in X,\, \left\langle a,\,x\right\rangle + b \ge 0.

If X has nonempty interior, there is a hyperplane separating \{x_0\} from the interior of X, and if it is empty, the entirety of X lies inside an affine subspace of \mathbb{R}^n, and we can take this subspace as our hyperplane.

Notice that this theorem has the equivalent formulation as \forall x \in X,\, \left\langle a,\,x\right\rangle \ge \left\langle a,\,x_0\right\rangle.

From convex sets to convex functions

After our detour on convex sets, it’s time to investigate how to convert these results about sets and hyperplanes into results about functions and cuts. After all, that is what we are interested in here.

Foremost, one remark: for simplicity of presentation we will work with extended real functions. These are functions that, besides the usual real numbers, can evaluate to \infty. The advantage of this approach is that we can just assume that all functions are defined everywhere on \mathbb{R}^n. We call their domain wherever they are finite and define them as infinity elsewhere.

The most common definition in the wild for a convex function is through Jensen’s inequality,

f\left(\lambda x + (1 - \lambda) y\right) \le \lambda f(x) + (1 - \lambda) f(y).

Nonetheless, convex functions are notorious for having a multitude of equivalent definitions. Hence, for our purposes it will be better to choose a more geometric one.

A function is convex if its epigraph is a convex set.

But what is this epigraph thing after all? Well, everybody knows a function’s graph,

\operatorname{\mathrm{graph}}(f) = \left\{ (x, y) \in \mathbb{R}^{n+1} \mid f(x) = y \right\}.

The epigraph adjoins the suffix epi-, from Greek ‘ἐπί’ meaning “above” or “on top of”, to denote the set of all points above the graph.

\operatorname{\mathrm{epi}}(f) = \left\{ (x, y) \in \mathbb{R}^{n+1} \mid f(x) \le y \right\}.

Besides sharing an etymology with epic, the epigraph also lets us easily translate results from convex sets to convex functions. For example, the supporting hyperplane theorem translates into a statement about tight cuts for convex function!

A convex function f has a tight cut at any point in its domain.

Visually this theorem looks like the figure below. You can hover it to view the cut for each point.

The graph of f lies in the boundary of its epigraph, which is convex. Therefore, from the supporting hyperplane theorem, for any x_0, the epigraph of f has a supporting hyperplane touching it at (x_0, f(x_0)). Since the tangent to a graph is non-vertical, the last component of this hyperplane cannot be zero. We can represent this supporting hyperplane as

\begin{aligned} \left\langle a,\,x\right\rangle + bf(x) &\ge \left\langle a,\,x_0\right\rangle + bf(x_0) \\ f(x) &\ge f(x_0) + \left\langle-\frac{1}b a,\,x - x_0\right\rangle. \end{aligned}

By defining \lambda = -\frac{{1}}{b} a, we arrive at

f(x) \ge f(x_0) + \left\langle\lambda,\,x - x_0\right\rangle.

With this theorem we finally achieved all we need to be able to calculate, not just postulate, cuts for a large set of functions.

Cuts from derivatives

Cuts are a special case of tangent hyperplane which does not cross the functions graph anywhere. This means that they have everything to do with derivatives. Indeed, we just arrived into our first method for calculating cuts.

Suppose f is convex and differentiable at x_0. From the convexity it has a tight cut with inclination \mu

f(x) \ge f(x_0) + \left\langle\mu,\,x - x_0\right\rangle.

This cut is a tangent hyperplane at x_0, and, from differentiability, this hyperplane is unique with inclination given by the derivative. Therefore

\boxed{\forall x,\, f(x) \ge f(x_0) + \left\langle\operatorname{\mathrm{d}}\!{f}(x_0),\,x - x_0\right\rangle.}

Now that’s a formula! Recall our sandwich algorithm There I asked you to take in good faith my words about our oracle, and now It’s finally time to pay my debt and turn that black box into a simple procedure.

All we need is a derivative Julia function that calculates the derivative of f at a point x. There are lots of packages which do that using either automatic or numerical differentiation. I’ll just let you choose your favorite. Our oracle then becomes a simple function that assembles the cut.

function oracle(f, a)
  fa  = f(a)
  dfx = derivative(f, a)

  return fa, Cut(a, fa, dfa)    # Cut(x) = fa + dot(dfa, x - a)
end

Cuts from Lagrangian duality

We have learned how to calculate cuts for a convex function at any point of differentiability. Nevertheless, the supporting hyperplane theorem guarantees a cut at every point on the function’s domain. How can we also calculate cuts where f is not differentiable?

If you are coming from an area of mathematics where everything is smooth, this last question may sound strange. Nevertheless, in operations research and optimization, it is rather common to stumble into non-smooth functions. Take the polyhedral functions, for example. They almost always have edges and corners, nonetheless, it should be easy to calculate cuts for them. They are made out of linear pieces, after all!

Our focus on this section will be on calculating cuts for optimal value functions, that is, functions defined as the optimal value of a parameterized optimization problem, such as those one would encounter, for example, in dynamic programming.

\begin{array}{rl} f(x) = \min\limits_{u} & c(u) \\ \textrm{s.t.} & (x, u) \in X. \end{array}

This last definition may, at first, seem too restrictive. Perhaps even more than requiring differentiability. There is a small trick, however, that lets us write any function as an optimal value function:

\begin{array}{rl} f(x) = \min\limits_{u} & f(u) \\ \textrm{s.t.} & u = x. \end{array}

If this feels like cheating, it’s because it certainly is! In practice, exchanging evaluation by optimization may be rather bad for performance. From a theoretical point-of-view, however, this representation always works and we are all safe.

The advantage of concentrating on optimal value functions is that it opens to us a whole world of tools from Optimization. In particular, we will see that Lagrangian duality has everything to do with cuts.⁴

Best cut given an inclination

Suppose you have a function f and an inclination \lambda. This inclination defines a whole family of cuts for f by varying the constant term in

f(x) \ge \underbrace{b}_\text{this term} + \left\langle\lambda,\, x - x_0\right\rangle.

Multiplier b = 0.

Of course, as good optimizers that we are, we won’t be happy with just any cut. We want to find the best constant b possible given the inclination \lambda. And, just to keep things straight: best in this context means that b makes the cut tight.

Assuming a valid cut, we can quantify how close it is to f by calculating the minimum distance between both functions in their entire domain:

\min_y \left| f(y) - \big(b + \left\langle\lambda,\, y - x_0\right\rangle\big) \right|.

We can drop the absolute value in the distance above because valid cuts are everywhere below f, making the difference always non-negative. Also, for the cut to be tight, it must equal the function at a certain point and, consequently, have a zero gap. Let’s call this optimal b = L(x_0; \lambda) to mark its dependence on the parameters. This leaves us with

\begin{aligned} 0 &= \min_y f(y) - \big(L(x_0; \lambda) + \left\langle\lambda,\, y - x_0\right\rangle\big) \\ L(x_0; \lambda) &= \min_y f(y) + \left\langle\lambda,\, x_0 - y\right\rangle. \end{aligned}

Recall that we are considering f to be an optimal value function. Since L(x_0, \lambda) is also defined by an optimization problem, let’s unroll its definition into a single minimization.

L(x_0; \lambda) = \begin{array}{rl} \min\limits_{y, u} & c(u) + \left\langle\lambda,\, x_0 - y\right\rangle \\ \textrm{s.t.} & (y, u) \in X. \end{array}

Thus, we can turn an inclination into a tight cut by solving an optimization problem that is, in general, not much different from f itself. Also, the formula for the optimal constant that we just arrived is common enough in the literature to deserve its own definition.

The Lagrangian relaxation of an optimal value function f is the optimal value function \begin{array}{rl} L(x; \lambda) = \min\limits_{y, u} & c(u) + \left\langle\lambda,\, x - y\right\rangle \\ \textrm{s.t.} & (y, u) \in X. \end{array}

Below you can view the Lagrangian relaxation for different multipliers \lambda. Multiplier: \lambda = 1.

The previous definition is the basis for Lagrangian duality, a tool known for its applications in continuous optimization. In this context, it is customary to call the inclination \lambda a Lagrange multiplier The name “relaxation” comes from interpreting the transition from f to L as relaxing the hard constraint (x, u) \in X into a linear penalty on how much the solution diverts from the original feasible set.

The above means that the relaxation is an underapproximation to f at all points,

f(x) \ge L(x; \lambda).

Furthermore, it is itself affine. With respect to any point x_0, the relaxation is expressible as

L(x; \lambda) = L(x_0; \lambda) + \left\langle\lambda,\, x - x_0\right\rangle,

From the two properties above, we can deduce the already known fact that the Lagrangian relaxation defines a cut for f guaranteed to be tight:

\boxed{f(x) \ge L(x_0; \lambda) + \left\langle\lambda,\, x - x_0\right\rangle.}

Notice though that the cut above is not necessarily tight at x_0! The point of intersection between L and f depends on the multiplier \lambda.

Best cut given a point

By first fixing the inclination and only then solving the Lagrangian relaxation, we lose control of where the cut touches the original function. Hence, a natural question to ask is, given a parameter x, what inclination yields the tightest relaxation at this point? This is called the dual value function and amounts to a minimax problem:

\begin{aligned} \check{f}(x) &= \max\limits_{\lambda} L(x ; \lambda) \\ &= \max\limits_{\lambda} \min\limits_{(y, u) \in X} c(u) + \left\langle\lambda,\, x - y\right\rangle. \end{aligned}

That thing looks scary but, fortunately for us, all solvers are pretty good at it. In fact, for the most common types of optimization programs, the dual problem has a known closed form and all state-of-the-art algorithms in (continuous) optimization⁵ solve for the primal and the dual problems at the same time. Who would say that optimizing two function is easier than a single one?

Since \check{f} is the maximum of lower approximations to f, it is also a lower approximation, a result known as weak duality.

\boxed{f(x) \ge \check{f}(x)}

It is also guaranteed to be convex, for it is the maximum of affine functions. Did we just say it is a convex lower approximation constructed from cuts? It looks like we are getting to something.

Moreover, one can prove that the dual function is not just some ordinary convex function, but the tightest convex function everywhere below f. So the cuts for \check{f} are also the best ones possible for f at a certain point. The best part is that these cuts come for free from solving the optimization problem because they are formed by the dual value \check{f}(x) and its associated multiplier \lambda_x.

There is a cut for f at the point x_0 defined by the dual value \check{f}(x_0) and optimal dual multiplier \lambda_0. That is,

\forall x, f(x) \ge \check{f}(x_0) + \left\langle\lambda_0,\, x - x_0\right\rangle.

The dual value function is the maximum among all Lagrangian relaxations. Let’s look at it centered at x_0.

\check{f}(x) = \max\limits_{\lambda} L(x_0; \lambda) + \left\langle\lambda,\, x - x_0\right\rangle.

The maximum above selects the best \lambda for the chosen parameter x. Therefore, its solution is above any other choice of \lambda we can make, including our desired \lambda_0.

\check{f}(x) \ge L(x_0; \lambda_0) + \left\langle\lambda_0,\, x - x_0\right\rangle.

Now the term L(x_0; \lambda_0) is the relaxation at x_0 with the optimal multiplier \lambda_0 at this point, which is precisely the definition of \check{f}(x_0). This yields a tight cut for the dual function.

\check{f}(x) \ge \check{f}(x_0) + \left\langle\lambda_0,\, x - x_0\right\rangle.

From weak duality, it follows that it is also a cut for f(x).

f(x) \ge \check{f}(x) \ge \check{f}(x_0) + \left\langle\lambda_0,\, x - x_0\right\rangle.

Perfect! We just discovered that the solutions to the dual problem are cuts for the primal problem. And, best of all, solving an optimization problem in general already provides us with the dual solutions, meaning that we get cuts for free. How cool is that?

Of course, now is the time for you to point your finger towards me and say that there is no tightness guarantee in the previous theorem. Although the dual is the best relaxation possible, it could still be too below the primal value function. But remember that we discussed above that \check{f} is the tightest convex function approximating f from below. Thus, they should be equal whenever f is convex. This is an important result called strong duality, which states that \check{f}’s epigraph is the closed convex hull of f’s epigraph.⁶ Therefore, now we also know how to calculate tight cuts for the points of non-differentiability of a convex function!

f\;\text{convex} \implies \forall x, f(x) \ge f(x_0) + \left\langle\lambda_0,\, x - x_0\right\rangle.

As a bonus, if you like automatic differentiation as much as I do, you also just learnt how to calculate the derivative for any function performing convex optimization during its execution. Suppose that f is differentiable at x_0. Since the derivative is unique and the cut provides us with a tangent, it follows that the derivative equals the optimal multiplier \nabla f(x_0) = \lambda_0. Thus, by solving a parameterized (convex) optimization problem, you gain both an evaluation and a derivative. Now all you have to do is plug it into the chain rule et voilà!

Farewell

Today you gained a new tool to your repertoire of functional representations. We have learned how cutting planes are a computationally amenable structure that harmoniously integrate with convexity and optimization, and how we can use them to represent functions for minimization. And, best of all, if you are working with convex optimization or calculating any derivatives, you already have some cuts lying around for free!

I am also planning some further posts were we will explore how to combine cuts with value and policy iteration to extend dynamic programming to infinite dimensional state spaces, and how to put to good use the various available ways to calculate cuts for mixed integer programs.

Good convergence for y’all and stay tuned!

Acknowledgements

I want to thank Ivani Ivanova and Pedro Xavier for commenting and proofreading this post’s first draft.

Convex Optimization (Cambridge, UK ; New York: Cambridge University Press, 2004), chap. 2, https://web.stanford.edu/~boyd/cvxbook/.↩︎
For the Functional Analysts lurking around: in infinite dimension, it is equivalent to the Hahn-Banach Theorem.↩︎
Convex Optimization.↩︎
For an overview of duality, you can (again) check Boyd and Vandenberghe, ibid. And for duality in the context of optimal value functions, there is also my M.Sc. thesis. (Iago Leal de Freitas, “Convexification by Averages” (Master’s thesis, Universidade Federal do Rio de Janeiro, 2019), https://iagoleal.com/data/mscthesis.pdf)↩︎
And consequently all free or commercial solvers in the market.↩︎
Technically, the theorem requires f to be a proper convex lower semicontinuous function. But in practice, convexity tends to be the only part you need to worry about.↩︎