Cuts for Stochastic Programming

25 April 2024

Life is full of uncertainties and so are the problems one must face. When solving optimization problems with stochastic components, it is necessary to adapt your solution methods to prevent an explosion on the model’s size.

Let’s continue today with this blog series on cutting planes by taking them as our tool of choice for stochastic optimization. Since these problems can get huge pretty easily, we focus on the possible ways to represent the cuts during the solve and their consequences for parallelization.

By the way, the content about non-convex stochastic programs comes directly from my Master’s thesis, called Convexification by Averages.

Optimizing under an Uncertain Future

In many real-world problems, it’s necessary to take a decision incurring an immediate cost plus future consequences. Think about buying a car, for example: there is the payment at the buying time plus future costs such as maintenance, fuel or even purchase installments.

\begin{array}{rl} \min\limits_{z} & c_{\text{now}}(z) + c_{\text{future}}(z) \\ \textrm{s.t.} & z \in Z. \\ \end{array}

As the future is known to be an uncertain beast, we cannot be sure of how much of its costs. At best, we can estimate it using a random function c_{\text{future}}. Besides that, considering the decision-maker you are, let’s break the decision variable into a current decision x and a future decision y, which itself depends on x. This way, we’re better able to codify how different parts of the program interact.

\begin{array}{rl} \min\limits_{x, y} & c_{\text{now}}(x) + c_{\text{future}}(x, y) \\ \textrm{s.t.} & x \in X \\ & (x, y) \in Y. \end{array}

Thanks to our temporal considerations, the program above has a particular structure we’re able to exploit. Notice that the present only depends on x while the future depends on both variables.

Whenever the variable dependencies have this particular “L-shape”, it is possible to break the program into two interacting parts:

A first stage that is deterministic but whose total cost depends on the ensuing stage’s cost;
A second stage that is stochastic and whose parameters depend on the previous stage’s decision.

Notice that one needs to solve both stages together, because of their mutual dependence. By introducing a (random) optimal value function Q for the second stage, called its cost-to-go, our decision problem becomes

\begin{array}{rl} \min\limits_{x} & c_{\text{now}}(x) + Q(x) \\ \textrm{s.t.} & x \in X, \\ Q(x) = \min\limits_{y} & c_{\text{future}}(x, y) \\ \textrm{s.t.} & (x, y) \in Y. \end{array}

Also, the stochastic second stage is equivalent to a family of deterministic value functions, each with its decision variables y^s, where the distinct scenarios do not interact. Thus, besides its L-shape, the future dependency on y also breaks into small blocks corresponding to each scenario. This will be important in the ensuing sections to make the solution methods computationally tractable.

There is still one detail to sort out in this formulation: since the cost-to-go Q is stochastic, the total cost is also random. This is terrible for analysis, because, commonly, one wishes the final decision the model spits out to be bold and deterministic. Therefore, to properly solve the optimization problem, we need a way to aggregate the second stage’s cost into a single number. Since the best constant representing a random variable is its expected value, we take the average of all possible future outcomes to represent its cost.¹

\begin{array}{rl} \min\limits_{x} & c_{\text{now}}(x) + \mathbb{E}\left[ Q(x) \right] \\ \textrm{s.t.} & x \in X. \\ \end{array}

This setup configures a (two-stage) stochastic program and appears in a lot of industrial problems. Since it is the main subject we’re going to explore today, I think it deserves its own definition.

A stochastic program is a two-stage optimization problem with random future.

\begin{array}{rl} \min\limits_{x} & c_1(x) + \mathbb{E}\left[ Q(x) \right] \\ \textrm{s.t.} & x \in X \\ Q(x) = \min\limits_{y} & c_2(x, y) \\ \textrm{s.t.} & (x, y) \in Y. \end{array}

Stochastic Optimal Value Functions

Stochastic optimal value functions allow us to talk about, and solve, optimization problems with uncertain components. They take the form

\begin{array}{rl} Q(x) = \min\limits_{y} & c(y) \\ \textrm{s.t.} & (x, y) \in Y \end{array}

Where the objective c and the feasibility set Y are random. Leaving any concerns about measurability aside ², this defines a random variable for each input x and is, thus, what one calls a random function.

For finite uncertainty, a good intuition is to think of a random function as a collection of functions defined for each scenario.

As usual in probability, we treat random functions as a normal function but whose output may depend on some stochastic external state. Taking this view, we say that a random function has some property, such as being non-negative, linear, convex, monotone etc. whenever it has this property whatever the outcome.

For example, consider Q(x) = D_{6}\cdot x^2, where D_6 corresponds to throwing a 6-sided die. It is a convex, non-negative random function because these properties are valid no matter the scenario.

To make our notation more functional, we’re going to write \mathbb{E}\left[ Q \right] for the deterministic function defined pointwisely as the average of Q,

\mathbb{E}\left[ Q \right](x) = \mathbb{E}\left[ Q(x) \right].

These averages will appear a lot throughout this post, so it is good to get acquainted with them soon. For optimization, there is an important result — which we will use a lot — regarding the average: it preserves convexity.

The average \mathbb{E}\left[ Q \right] of a convex random function Q is also convex.

In the figure below you can see a convex random function (faded) together with its average (bold graph).

Convex Stochastic Programs

To start delving into the world of stochastic programs, let’s begin by the convex case. Hence, throughout this section we will always assume that the second stage is a random convex optimization problem. We need to put no restrictions on the first stage, since we will only need cuts for the future part. Now it’s time to approximate some random functions by cuts.

Thanks to convexity, we can use Lagrangian duality at each scenario to produce (random) tight cuts at any point of choice. Suppose we want to calculate our cut at x_0. It defines a random optimal value Q(x_0) but also a dual multiplier \Lambda dependent on the solution of Q(x_0). This produces the cut we want:

Q(x) \ge Q(x_0) + \left\langle\Lambda,\, x - x_0\right\rangle.

Since expected values preserve inequalities, this equation is all we need to approximate \mathbb{E}\left[ Q \right] by cuts.

If a random cut is tight at x_0 for Q, then its average is tight for the expected function \mathbb{E}\left[ Q \right],

\mathbb{E}\left[ Q \right](x) \ge \mathbb{E}\left[ Q \right](x_0) + \left\langle\mathbb{E}\left[ \Lambda \right],\, x - x_0\right\rangle.

Hover the figure below to calculate a cut for each scenario and an average cut.

The theorem above is the key for solving two-stage stochastic programs using cuts. When the uncertainty has a finite amount of scenarios, we can calculate the average cut as a two-step process:

For each scenario s, calculate a cut for the (deterministic) sample Q^s;
Combine the optimal values and multipliers into a cut for \mathbb{E}\left[ Q \right].

For infinite uncertainty, it is possible to use Sample Average Approximation or the law of large numbers to estimate the averages by independently sampling from Q. But this is out of scope for this post.

Single Cut Approach

Now that we know how to approximate the average of a random function, we can use it to construct an algorithm for solving stochastic programs. The idea is to adapt Kelley’s cutting plane algorithm for stochastic programs.

All algorithms we’re going to see in this post are variations on the same idea. We turn the recursive relation between present and future into an iterative algorithm, where the first stage sends its decision as a parameter to the second stage, which in counterpart sends cuts to improve the first stage’s view of the future. For convex programs, this process eventually converges towards a good enough approximation of the optimal decision.

Now, it’s time to get technical. Consider a pool of cuts \mathcal{C} for \mathbb{E}\left[ Q \right] that starts empty. From now on, we will call it the algorithms cut pool. This pool allows the construction a polyhedral underapproximation of \mathbb{E}\left[ Q \right] as

\mathfrak{Q}(x) = \max\limits_{(b, \lambda) \in \mathcal{C}} b + \left\langle\lambda,\, x\right\rangle.

This is called a single cut approximation for \mathbb{E}\left[ Q \right] because it is made only of cuts that directly approximate the average (But this nomenclature will make no difference until next section, so let’s go back to how to use our new friend to solve stochastic programs.) With \mathfrak{Q} at hand, we can construct an underapproximation for the total cost z^\star as

\begin{array}{rrl} z(\mathcal{C}) &= \min\limits_{x} & c_1(x) + \mathfrak{Q}(x) \\ &\textrm{s.t.} & x \in X \\ &= \min\limits_{x,t} & c_1(x) + t \\ &\textrm{s.t.} & x \in X, \\ & & t \ge b + \left\langle\lambda,\, x\right\rangle,\; \forall (\lambda, b) \in \mathcal{C}. \end{array}

We are substituting the complicated expected cost-to-go \mathbb{E}\left[ Q \right] by a bunch of linear constraints, making the problem much easier. Notice also that if the first stage was convex or an LP, it remains so, which is a big win in terms of plugging it into a solver.

If \mathfrak{Q} is a good approximation of Q, we expect z(\mathcal{C}) to be close to the real optimal z^\star. But what do we do if it isn’t? We get more cuts to improve it, of course! By using what we discussed in the previous section, we can calculate tight cuts for \mathbb{E}\left[ Q \right]. Whenever we solve the first stage, it gives us a solution x \in X that is feasible in the real problem. We can use it as the parameter to solve the deterministic optimization problem corresponding to each scenario,

\begin{array}{rl} Q^s(x) = \min\limits_{y} & c_2^s(y) \\ \textrm{s.t.} & (x, y) \in Y^s. \end{array}

The expectation of the optimal value and dual variable define a tight cut for \mathbb{E}\left[ Q \right], which we include into \mathcal{C}. In code, the procedures for calculating a new cut looks like this.

function average_cut(prog, x)
  v, y, λ = [], [], []   # value, solution, multiplier

  for s in prog.Scenarios
    v[s], y[s], λ[s] = solve(prog, s, x)
  end

  return Cut(mean(v), mean(λ), x)
end

To iteratively solve a stochastic program, the algorithm alternates between two steps, which correspond to how the information propagates between the stages:

Solve the first stage and propagate the primal solution x forwards to the second stage;
Solve the second stage for each scenario and propagate the average cut backwards to the first stage.

Graphically, we visualize this process as the decision tree below. The nodes are optimization problems and the edges represent their communication. We represent \mathfrak{Q} by a small diamond, which is the only thing the first stage’s node sees of the future.

We repeat these forwards-backwards steps until the algorithm converges to a solution for the original problem. The only thing remaining is a stopping criterion for us to know when the solution is good enough. For this, we use that we calculate both primal and dual solutions during the algorithm to estimate upper and lower bounds over the real optimal.

Lower bound: Since \mathfrak{Q}\le \mathbb{E}\left[ Q \right], the calculated first stage value is below the real optimal value: z(\mathcal{C}) \le z^\star.
Upper Bound: The calculated solutions x, y^s for each stage are feasible, thus their cost must be above the true minimum: c_1(x) + \mathbb{E}\left[ c_2(y) \right] \ge z^\star.

These bounds “sandwich” the true solution and, when they’re closer than a required tolerance, we can consider our approximation good enough. The code snippet below summarizes this whole procedure.

function solve_single_cut(stage1, stage2; tol = 1e-8)
  v2, y, λ = [], [], []   # Value, solution, multiplier
  ub, lb   = +Inf, -Inf

  while ub - lb > tol
    # First stage
    v1, x = solve(stage1)
    # Second stage
    for s in stage2.Scenarios
      v2[s], y[s], λ[s] = solve(stage2[s], x)
    end
    # Update approximation for E[Q]
    addcut!(stage1, Cut(mean(v2), mean(λ), x))
    # Check the bounds
    lb = opt1
    ub = stage1.cost(x) + mean(v2)
  end

  return x, y   # Calculated optima
end

A Note on Parallelism

Notice on the algorithm above that the second-stage scenarios are independent of each other. Since solving optimization problems is no simple task, one use this to distribute the work over multiple processors.

The algorithm is not completely parallel though, because the processors must synchronize their work to calculate the average cut. While you can solve |\mathcal{S}| optimization problems in parallel in the second stage, the process of gathering their results to produce a cut, updating the first-stage problem and solving it to get a new solution is synchronous. While only one processor runs this part, all others must wait.

In the next section we will learn a more flexible representation for the cost-to-go that, at the cost of more memory usage, overcomes this limitation.

Multicut Approach

In the single cut approach, we calculate one cut per iteration to directly approximate the cost-to-go \mathbb{E}\left[ Q \right]. Thus, in a sense, approximating the average is the second stage’s job: the first stage never sees any information regarding the different scenarios, it already comes aggregated to it.

Another possibility is to use a multicut approximation where we keep a pool of cuts \mathcal{C}^s for each scenario and construct separate approximations

\mathfrak{Q}^s(x) = \max\limits_{(b, \lambda) \in \mathcal{C}^s} b + \left\langle\lambda,\, x\right\rangle.

Now it is the first stage’s job to take these approximations and convert into an approximation to \mathbb{E}\left[ Q \right]. We do this by noticing that

\mathfrak{Q}^s \le Q^s \implies \sum_{s \in \mathcal{S}} p_s \mathfrak{Q}^s \le \mathbb{E}\left[ Q \right]

And plugging this whole average expression into our first stage approximation.

\begin{array}{rrl} z(\mathcal{C}) &= \min\limits_{x} & c_1(x) + \sum_{s \in \mathcal{S}} p_s \mathfrak{Q}^s(x) \\ &\textrm{s.t.} & x \in X \\ &= \min\limits_{x,t_s} & c_1(x) + \sum_{s \in \mathcal{S}} p_s t_s \\ &\textrm{s.t.} & x \in X, \\ & & t_s \ge b + \left\langle\lambda,\, x\right\rangle,\; \forall s \in \mathcal{S}, (\lambda, b) \in \mathcal{C}^s. \end{array}

The multicut approach defines a larger optimization problem than the one we had for the single cut, because it unlinks the cost-to-go for each scenario.

We can use this structure to build a similar algorithm to our previous one. The main difference is that the multicut solver adds the cuts immediately after completing each scenario.

function solve_multicut(stage1, stage2; tol = 1e-8)
  v2, y, λ = [], [], []   # Value, solution, multiplier
  ub, lb   = +Inf, -Inf

  while ub - lb > tol
    # First stage
    v1, x = solve(stage1)
    # Second stage
    for s in stage2.Scenarios
      v2[s], y[s], λ[s] = solve(stage2[s], x)
      # Update approximation for E[Q[s]]
      addcut!(stage1, s, Cut(v2[s], λ[s], x))
    end
    # Check the bounds
    lb = opt1
    ub = stage1.cost(x) + mean(v2)
  end

  return x, y   # Calculated optima
end

Flexibility and parallelism

An advantage of representing each scenario separately in the first-stage is the flexibility it gains us. Suppose, for example, that there is a scenario with a value function that’s more complicated than the others. In the single cut approach we have to solve all scenarios to obtain an average cut, and, therefore, cannot focus on the problematic one. With separate cut pools, on the other hand, it is possible to adapt the algorithm to sample more often the scenarios known to be harder.

This idea is particularly useful when parallelizing the algorithm because it doesn’t generate a synchronicity bottleneck for calculating average cuts. We could, for example, assign some scenarios for each worker and a processor responsible for solving the first stage from time to time. As soon as a worker produces a cut, it can send it over to the first stage to improve its approximation without affecting the other processors.

When the Future is Non-Convex

It is time to leave the peaceful and colorful land of convexity to enter the dark and haunted land of general, not necessarily convex, functions. Recall from the previous post that for an arbitrary deterministic function, the best cut we can get is only tight for its convex relaxation (dual function). As will see, this makes the story a bit more complicated.

First and foremost, we must modify our random cut inequality to take the dualization into account:³

Q(x) \ge \check{Q}(x_0) + \left\langle\Lambda,\, x - x_0\right\rangle.

By using the previous section’s approach, we decompose Q(x_0) in scenarios and calculate the average cut, which is an underapproximation for the expected function,

\mathbb{E}\left[ Q \right](x) \ge \mathbb{E}\left[ \check{Q} \right](x_0) + \left\langle\mathbb{E}\left[ \Lambda \right],\, x - x_0\right\rangle.

Despite being a valid cut, it is, in general, not tight — not even for the dual. As a matter of fact, it is not the best possible cut because by decomposing Q in scenarios, we ended losing information about how the non-convexities interact. The problem is that we end up calculating cuts for the average of duals \mathbb{E}\left[ \check{Q} \right] while the cuts we really want are for the dual of the averages \widecheck{\mathbb{E}\left[ Q \right]}. This remark is important enough to deserve its own lemma.

The average of convex relaxations is less than the average’s relaxation.

\mathbb{E}\left[ \check{Q} \right] \le \widecheck{\mathbb{E}\left[ Q \right]} \le \mathbb{E}\left[ Q \right].

\mathbb{E}\left[ \check{Q} \right] is a convex underapproximation of \mathbb{E}\left[ Q \right] because the expected value preserves inequalities and convexity:

Convex: Each \check{Q} is convex, and their average \mathbb{E}\left[ \check{Q} \right] is a non-negative linear combination of them;
Underapproximation: \check{Q} \le Q \implies \mathbb{E}\left[ \check{Q} \right] \le \mathbb{E}\left[ Q \right].

However, the convex relaxation \widecheck{\mathbb{E}\left[ Q \right]} is defined as the largest convex underapproximation of Q. Therefore, it is everywhere above \mathbb{E}\left[ \check{Q} \right].

We illustrate this theorem in the figure below, where you can see the ordering among the functions.

Despite being short, there is a lot to unpack in the theorem above.⁴ One interpretation of it — which we already remarked — is that each non-convex Q^s has rims and bumps in different points. By taking the average, these corners may cancel out leaving \mathbb{E}\left[ Q \right] closer to being convex than it would appear by looking at each scenario separately. In fact, in some special situations, the average could even be convex!

For solving optimization problems, the important part is the theorem’s consequence for cuts. To approximate the expected cost-to-go as tightly as possible, we need a way to directly approximate its convex relaxation.

Linked formulation

Both the single and multicut approaches, when applied to a non-convex second stage only yield cuts for \mathbb{E}\left[ \check{Q} \right], because they calculate an average cut by decomposing the problem into scenarios. To get tighter cuts, we need an approach that takes into account all cuts at the same time.

The idea is to build one huge optimization problem representing the entirety of \mathbb{E}\left[ Q \right] by linking all Q^s together. Start by recalling the definition of the problem for each scenario,

\begin{array}{rl} Q^s(x) = \min\limits_{y^s} & c_2^s(y^s) \\ \textrm{s.t.} & (x, y^s) \in Y^s, \end{array}

Where we purposefully write y^s for the decision variable to mark its dependence on the scenario. The expected cost-to-go is

\begin{aligned} \mathbb{E}\left[ Q \right](x) &= \sum_{s \in \mathcal{S}} p_s Q^s(x) \\ & = \begin{array}{rl} \sum\limits_{s \in \mathcal{S}} p_s \min\limits_{y^s} & c_2^s(y^s) \\ \textrm{s.t.} & (x, y^s) \in Y^s. \end{array} \end{aligned}

Since the constants p_i are non-negative and independent from the minimization, we can effectively “commute” the expectation with the minimum to arrive at an optimal value function.

\boxed{ \begin{array}{rl} \mathbb{E}\left[ Q \right](x) = \min\limits_{y^s} & \sum\limits_{s \in \mathcal{S}} p_s c_2^s(y^s) \\ \textrm{s.t.} & (x, y^s) \in Y^s. \end{array} }

This is the linked formulation for the expected value of an optimal value function. Since it is a huge non-convex program, rest assured that it can be slow as hell to solve. Nevertheless, one can use this formulation’s dual to calculate the tightest cut possible for \mathbb{E}\left[ Q \right], i.e., any standard method for tight cuts will return a multiplier \lambda satisfying⁵

\mathbb{E}\left[ Q \right](x) \ge \widecheck{\mathbb{E}\left[ Q \right]}(x_0) + \left\langle\lambda,\, x - x_0\right\rangle.

We can, therefore, use this cost-to-go formulation to build a stochastic programming algorithm. In comparison to the previous decomposed formulations, in this case we represent the second stage as a single node containing all scenarios.

From the first stage’s point of view, this is indistinguishable from a single cut calculation, because at each iteration we produce only one cut. We also need to adapt the stopping criterion because, for non-convex problems, there is no guarantee of convergence. There is hope to find a good solution after enough iterations, nonetheless.

function solve_linked(stage1, stage2; tol = 1e-8, max_iterations :: Int)
  ub, lb   = +Inf, -Inf
  iter     = 1

  while ub - lb > tol || iter > max_iterations
    # First stage
    v1, x = solve(stage1)
    # Second stage
    v2, ys, λ = solve(linked(stage2), x)   # Solve linked formulation for 2nd stage
    # Update approximation for E[Q]
    addcut!(stage1, Cut(v2, λ, x))
    # Convergence criteria
    lb    = opt1
    ub    = stage1.cost(x) + v2
    iter += 1
  end

  return x, ys  # Calculated optima
end

Notice from this code that we lost the embarrassingly parallel nature of the second stage. Although the linked formulation could gain from parallelism in the solver, there is no room for parallelizing directly in the scenarios. The second stage becomes a big opaque block.

Mixing and Matching Approaches

Before we close this section, it is worth comparing the linked versus the decomposed approach. As is common in optimization, when the problem at hand is not convex, we end up in a dilemma of speed against accuracy.⁶ Fortunately, methods never get jealous and one does not need to commit to any single one.

Let’s take a high-level look into how to combine our many approaches. First, remember that we aren’t really interested in the cuts themselves. They are only a tool towards our real objective: an (approximately) optimal solution (x, y). So, we can play a bit with how to represent the cost-to-go. An idea, thus, is to solve using a decomposed formulation but, once in a while — when the solution stagnates, for example — calculate a linked cut to improve our overall search space.

Another more sophisticated approach would be to take advantage of a parallel machine to calculate linked and decomposed cuts concurrently. It will need at least three workers: one for the first stage, one for the linked second stage, and the others for the scenarios.

First, notice that we can write the first stage in a way that accepts both single and multicuts. By denoting \mathcal{C}^s the cut pool for scenario s and \mathcal{C}^\mathbb{E} the pool of aggregated cuts, we represent the first stage as

\begin{array}{rl} \min\limits_{x,t, t_s} & c_1(x) + t \\ \textrm{s.t.} & x \in X, \\ & t = \sum_{s \in \mathcal{S}} p_s t_s \\ & t \ge b + \left\langle\lambda,\, x\right\rangle,\; \forall (\lambda, b)\in \mathcal{C}^\mathbb{E} \\ & t_s \ge b_s + \left\langle\lambda_s,\, x\right\rangle,\; \forall s \in \mathcal{S}, (\lambda_s, b_s) \in \mathcal{C}^s. \end{array}

The variable t represents the expected cost-to-go and, for consistency, is constrained to equal the average of the scenariowise cost-to-go. Since the second stage decisions do not communicate among themselves, the graph for this mixed approach can have separate nodes representing each cut method. In the figure, we illustrate an approach that concurrently calculates linked and multicuts.

This choice of aggregating with the expected value is called a risk-neutral formulation for a stochastic program and is, by far, the simplest and most common formulation you’ll find. Nevertheless, it is possible to use a risk-averse formulation by substituting the mean by any coherent risk measure. Everything we do in this post works the same for both formulations. We’re only sticking to the average because it is simpler to understand.↩︎
We will only work with finite uncertainties, so there shouldn’t be any difficulties.↩︎
We define the dual \check{Q} of a random function Q as its dual for each scenario: (\check{Q})^s = \widecheck{(Q^s)}. This is just a notation to make everything cleaner.↩︎
An abstract view of this theorem is that at each point x, the mapping f \mapsto \check{f}(x) is a concave function. What we just proved is the pointwise Jensen’s inequality for it.↩︎
For example, when the non-convexities arise from integer variables, you can use Lagrangian or Strengthened Benders cuts.↩︎
Much rather the decision of Benders vs Lagrangian cuts for MIP.↩︎