A Uniform Probability over Probabilities

21 May 2024

Oftentimes, I need to sample points from a finite set. Usually, there is an obvious way to draw elements, but every so often, I end up needing the probability distribution itself to be random (e.g. property testing for solvers depending on weighted averages.) To avoid possible statistical misfortunes, it is also desirable to choose such a distribution uniformly among all possibilities.

As it stands, there are some methods to do that. Today we’re going to explore my favorite one, whose only requirement is uniformly sampling from the [0, 1] interval. Besides, it is pretty fast to compute and has an elegant geometric flavour. What else could we ask?

What’s more, a lot of the lemmas we’ll use are L^1 versions of well-known results for Gaussians. Although their equivalence seems to be common knowledge among mathematicians, I don’t know about any reference that does things this way. So, as a bonus, now I have a place to link people to whenever I want to talk about Markov kernel-invariance. Oh well, it seems I’m putting the cart before the horse in here. Time to start out.

In case you’ve landed on this page in search for a sampling algorithm and don’t care for the mathematical intricacies — no matter how elegant they might be — here is the final theorem, which should get you covered.

Let U_i \sim \mathrm{Unif}[0, 1] be i.i.d. uniform random variables and define E_i = -\log(U_i) \sim \mathrm{Exp}(1). The barycentric projection

Z = \frac{1}{\sum_i E_i} E

Is uniformly distributed over all N-element probability weights.

Probabilities Over the Probabilities

To sample from a set, we must have a measurable structure. In our case, this means a probability over the probabilities. Since whenever our set has more than a single element, its probability distributions form an infinite set, we must ask ourselves whether our problem is even well-posed.

To answer these, let’s first reduce our problem to a more concrete one. For this, consider a finite set S. All probabilities over it can be written as linear combinations of the point masses (Dirac deltas) on the elements of \Omega:

p = \sum_{\omega \in \Omega} p_\omega \cdot \delta_\omega(x).

Since the p_\omega represents the probability of outcome \omega, not every linear combination of deltas is acceptable. To produce a probability distribution, the coefficients must form a convex combination, i.e.,

p_\omega \ge 0 \text{ and } \sum_{\omega \in \Omega} p_\omega = 1.

The set of all such coefficients is known as the standard simplex (from now on, just simplex for short). It is a compact convex subset of \R^{N} and, as such, has a clear notion of uniform probability inherited from the Lebesgue measure. Furthermore, since the set \Omega is fixed, sampling a random probability distribution is equivalent to sampling the coefficients. This way, our problem becomes geometrical: how to generate points uniformly in the simplex.

How to Sample From the Standard Simplex

Now forget all that abstract nonsense and let’s focus on subsets of \R^N. If you prefer to remain abstract, you can enumerate \Omega = \{\, \omega_1, \ldots, \omega_N \,\} and use the isomorphism \delta_{\omega_k} \mapsto e_k taking the point masses to the corresponding canonical basis vectors. Everything works the same.

Alright, we are interested in the set of all probability vectors in \R^N, called the standard simplex

\Delta^N = \left\{\, x \in \R^{N} \mid x \ge 0,\, \textstyle\sum_k x_k = 1 \,\right\}.

Browser lacks SVG support.

As a compact subset of \R^N, its normalized (hyper-)area element defines a uniform probability measure

p_{\Delta^N}(A) = \frac{\mathrm{Area}(A \cap \Delta^N)}{\mathrm{Area}(\Delta^N)}.

Our main goal in this section is to construct a way to sample from this probability using only the standard tools you’d find in any programming language, such as \mathrm{Unif}[0, 1] distributions. There are a couple ways to do it involving rejection sampling or some kind of combinatorics with partitions of the interval. The method of choice for this post, however, uses symmetries and mimics the well-known construction for the uniform distribution on the sphere.

Before getting technical, I think a sketch of the idea is due. We look at the non-negative cone \R^N_{\ge 0} as a stacking of simplexes r \Delta^N whose components sum to r — In the same way as \R^N is a stacking of spheres with radius r.

Browser lacks SVG support.

This amounts to parametrizing the positive orthant in barycentric coordinates:

x = r \sigma,\;\text{ where }\; r = \textstyle\sum_k x_k,\, \sigma \in \Delta^N.

To produce a distribution on \Delta^N, take a non-negative random vector X and scale it such that its components sum to one (barycentric projection).

Browser lacks SVG support.

By choosing an X whose distribution is invariant with respect to symmetries of the simplex, the projection’s distribution can be made uniform. Now that you’ve (hopefully) got some intuition on what’s our plan, it’s time to go on and prove the necessary theorems.

Symmetry and Invariance

Considering we talked about symmetries, the apparent next step would be taking a look at the group of linear automorphisms that preserve the simplex. Unfortunately, it is too meager: only permutation matrices and nothing more. In other words, its group of (linear) symmetries is just the — no pun intended – symmetric group on its vertices.

\mathrm{LinAut}(\Delta^N) \cong S_N.

We would prefer a continuous group for our symmetries. Nevertheless, not all is lost! The thing is: a full-on group was just too much to ask for. Instead, the monoid of linear transformations that preserve the simplex will attend our needs much better than that symmetry group.

A linear transformation M preserves a subset A \subset \R^N_{\ge 0} if it takes elements of A to elements of A,

x \in A \implies M x \in A.

In the case of the simplex, the definition above simplifies to a finite amount of conditions, because linear maps preserve extreme points of convex sets. Since the simplex is the convex hull of the canonical basis vectors, we only need to impose that their image does not leave the set to guarantee that a map preserves the simplex.

\boxed{M e_k \in \Delta^N,\, k = 1,\ldots, N.}

Browser lacks SVG support.

The above is the abstract version of our condition. It takes a more concrete shape by noticing that M e_k equals the coefficients on the k-th column of M.

M = \left[ \begin{array}{ccc} \rvert & & \rvert\\ M e_1 & \cdots & M e_N \\ \rvert & & \rvert \end{array} \right].

Consequently, the matrices that preserve the simplex are those whose components are themselves probability vectors.

If you’ve ever tinkered with Markov Processes, you may know those by the name of stochastic matrix or Markov Kernel, and perhaps even used them for the exact same reason: preserving the total probability in a transition step. Also, they’re the reason why my friends who decided to study probability to run away from algebra ended up needing to learn about semigroups and monoids. Even though I’m a fan of saying “Markov Kernels”, in this post we’ll go with the name “stochastic matrices” 1.

A matrix is stochastic if its columns are probability vectors (elements of \Delta^N). We denote by \mathbf{Sto}(N) the monoid of all such matrices.

Throughout the realms of mathematics, symmetries are known for producing invariants because, by preserving a set, they end up preserving some simple function related to it. Rotations conserve inner products and lengths, translations preserve the differences, etc. So, what is the invariant associated with stochastic matrices? As you might expect from the discussion before, they conserve the sum of components.

\sum_k (M x)_k = \sum_{k,l} M_{kl} x_l = \sum_l x_l \underbrace{\left( \sum_k M_kl \right)}_{= 1} = \sum_l x_l.

On top of this algebraic derivation, there is also a more geometrical point of view. Let’s use the barycentric coordinates from the previous section to write x = r \sigma, with r = \sum_k x_k and \sigma \in \Delta^N. By linearity, the coordinates of M x are

Mx = M(r\sigma) = r \underbrace{(M \sigma)}_{\in \Delta^N}.

Browser lacks SVG support.

Thus, a stochastic matrix alters the coordinates in the standard simplex but does not change in which simplex the vector is. It’s similar to how rotations alter directions but conserve in which concentric sphere a vector is.

Something cool about invariants is that we’re able to transfer them from points to functions. In general, if a function is G invariant, i.e.,

\forall M \in G,\, f(M x) = f(x),

One expects it to only depend on the quantities conserved by G. In our case of interest, \mathbf{Sto}(N), we rigorously prove that only the sum matters for such functions.

A function f : \R^N_{\ge 0} \to \R is \mathbf{Sto}(N)-invariant if and only if it only depends on the sum of the input’s components. That is, there is a \phi : \R_{\ge 0} \to \R such that

f(x) = \phi\left(\textstyle\sum\nolimits_k x_k\right).

f depends only on \textstyle\sum_k x_k \implies f is \mathbf{Sto}(N)-invariant:

The function f cannot “see” the changes M \in \mathbf{Sto}(N) produces:

f(M x) = \phi(\textstyle\sum_k (M x)_k) = \phi(\textstyle\sum_k x_k) = f(x).

f is \mathbf{Sto}(N)-invariant \implies f depends only on \textstyle\sum_k x_k:

We use the invariance to turn f into a one-dimensional function by constructing a stochastic matrix A that takes each simplex to a single coordinate. You can think of this procedure as a 1-norm version of rotating the coordinate system such x aligns to an axis. Concretely, we achieve this by defining A e_k = e_1, which amount to a matrix whose first row is all ones and the others are identically zero.

A = \left[ \begin{array}{ccc} 1 & \cdots & 1 \\ 0 & \cdots & 0 \\ \vdots & \vdots & \vdots \\ 0 & \cdots & 0 \end{array} \right]

This stochastic matrix accumulates the sum into the first component.

A x = \left[ \begin{array}{ccc} 1 & \cdots & 1 \\ 0 & \cdots & 0 \\ \vdots & \vdots & \vdots \\ 0 & \cdots & 0 \end{array} \right] \left[ \begin{array}{c} x_1 \\ x_2 \\ \vdots \\ x_N \end{array} \right] = \left[ \begin{array}{c} \sum_k x_k \\ 0 \\ \vdots \\ 0 \end{array} \right]

By invariance,

f(x) = f(A x) = f( (\textstyle\sum_k x_k) \cdot e_1) = \phi(\textstyle\sum_k x_k)

Where we define \phi(t) = f(t \cdot e_1).

This way, we see that being \mathbf{Sto}(N)-invariant constrains a function to a rather simple form, because it is as if it could only depend on a single axis. As we will shortly see, this locally constant restriction also simplifies a lot the possible invariant random variables.

A Concrete Sampling Algorithm

Now that we’re acquainted with stochastic matrices and their invariants, it’s time to go back to probabilities and distributions. When a random vector has a \mathbf{Sto}(N)-invariant distribution, its density will be locally constant on any scaling of the simplex. Hence, we can project it back to the standard simplex to get a uniform distribution.

Let E be an almost surely positive random vector whose distribution is continuous and \mathbf{Sto}(N)-invariant. The barycentric projection of E is uniformly distributed on the standard simplex,

\frac{E}{\sum_k E_k} \sim \mathrm{Unif}(\Delta^N).

The components being positive guarantees that \sum_k E_k is also almost surely positive and the projection Z = E / \sum_k E_k is well-defined. We know that Z is supported on the simplex because it is normalized. Let’s use the invariance to show that it is uniformly distributed.

Consider a region A \subset \Delta^N. The projection is in A if and only if the original variable E lies in the cone C_A of rays coming from the origin and crossing A somewhere,

\begin{aligned} \mathbf{Prob}\left(\frac{E}{\textstyle\sum_k E_k} \in A\right) &= \mathbf{Prob}\left(E \in \left\{\, r \sigma \mid r \in (0, +\infty), \sigma \in A \,\right\} \right). \end{aligned}

Browser lacks SVG support.

The distribution of E is continuous, so calculating the probability on the right amounts to your commonplace integral. Also, from invariance, its density f_E only depends on the sum of components (i.e., f_E(x) = \phi(\textstyle\sum x_k)). Thus, a change of variables for barycentric coordinates x = (\textstyle\sum_k x_k)\sigma seems like a good bet.

\begin{aligned} \mathbf{Prob}(Z \in A) &= \int_{C_A} \phi(\textstyle\sum_k x_k) dx = \int_0^\infty \int_A \phi(r) r^{n-1} d\sigma dr \\ &= \mathrm{Area}(A) \underbrace{\int_0^\infty \phi(r) r^{n-1} dr}_{\text{constant}} \end{aligned}

Consequently, the distribution of E / \textstyle\sum_k E_k is a multiple of the area element on the simplex, defining a uniform measure.

The previous theorem is a recipe for turning \mathbf{Sto}(N)-invariant distributions into ones that are uniform on the simplex. The only thing missing is to find a suitable distribution to project. What could it be? As it turns out, a vector of i.i.d. exponentially distributed random variables2 just cuts it. I don’t want to just postulate it, however. We’ve come this far from first principles, so let’s make the exponentials appear by themselves.

Any invariant distribution will do, so let’s go with the easiest kind: those with independent components. What I find the most impressive is that the constraints of \mathbf{Sto}(N)-invariance and independence together are strong enough to characterize exponential distributions. You can think of it as an adaptation of Maxwell’s theorem3 for the simplex.

The only \mathbf{Sto}(N)-invariant absolutely continuous distributions on \R^N_{\ge 0} (the non-negative cone) with independent components are identically distributed exponentials \sim \mathrm{Exp}(\lambda).

We only consider absolutely continuous distributions supported on the positive orthant. Hence, the distribution equals p_Z = f \cdot \mathbb{I}_{\ge 0} for an integrable probability density f. Let’s investigate this function. By \mathbf{Sto}(N)-invariance, this density only depends on the sum of components and, by independence, the joint density is a product of single-variable densities f_i,

f(x) = \phi\left(\textstyle\sum\nolimits_k x_k\right) = \prod_k f_i(x_k).

Let’s turn it into a system of differential equations and solve for a closed form. As we are working with distributions, there’s no need to worry about smoothness right now because all derivatives can be taken in a weak sense.

\partial_i f(x) = \phi'\left(\textstyle\sum\nolimits_k x_k\right) = f_i'(x_i) \prod_{k \ne i} f_k(x_k).

By dividing both sides by f, it becomes

\frac{\phi'\left(\textstyle\sum\nolimits_k x_k\right)}{\phi\left(\textstyle\sum\nolimits_k x_k\right)} = \frac{f_i'(x_i)}{f_i(x_i)}.

The above works independently of i, meaning that the fractions f_i'(x_i) / f_i(x_i) are all equal. However, each depends on a different variable. The only way this can happen is if they all equal the same constant. Let’s smartly call this constant -\lambda.

\frac{f_i'}{f_i} = -\lambda \implies f_i' = -\lambda f_i.

Great! As the only weak solutions to a linear ordinary differential equation are the classical ones, the above proves that the densities are exponentials.

\boxed{f_i(x_i) = C_i e^{-\lambda x_i}}.

You can use that the f_i are probability densities to deduce that \lambda > 0 and C_i = \lambda are the only admissible constants. The reasons are their tail has to go to zero and they must integrate to one.

\begin{array}{ll} \lim_{t \to \infty} C_i e^{-\lambda t} = 0 &\iff \lambda > 0, \\ 1 = \int_0^\infty C e^{-\lambda t} dt = \frac{C}{\lambda} &\iff C_i = \lambda. \end{array}

Therefore, for non-negative arguments, the joint density has the form

p_Z(x) = \lambda^N e^{-\lambda \sum_k x_k} \cdot \mathbb{I}_{\ge 0}

Which characterizes a multivariate exponential with i.i.d. components.

Among its many properties, the exponential distribution is famously simple to generate from a uniform distribution on [0, 1], as a consequence of the inverse transform sampling:

U \sim \mathrm{Unif}[0, 1] \implies -\frac{1}{\lambda}\log(U) \sim \mathrm{Exp}(\lambda).

Combining the above with this section’s theorems, we arrive at a way to sample uniformly from the simplex using only \mathrm{Unif}[0, 1] distributions — which are available in any programming language worth its salt. The function below illustrates how simple an implementation can be. To be fair, I could’ve even made it a one-liner, but it’s spelt out for legibility.

function sample_simplex(n)
  U = rand(n)              # Uniformly distributed on n-cube
  E = map(u -> -log(u), U) # E[i] ~ Exp(1)

  return E / sum(E)        # Uniformly distributed on simplex
end

It’s a single step to go from the previous method to a uniform sampling on the probabilities over probabilities. Assuming you have a type Dist for distributions, its just a mapping.

function sample_prob(xs)
  P = sample_simplex(length(xs))
  return Dist(xs[i] => P[i] for i in eachindex(xs))
end

Bonus: Using the Simplex to Sample Other Sets

The previous section already wrapped up our main goals. Nevertheless, it would be a shame to develop such a fine tool to simply finish the post and go home. Let’s thus explore a couple examples of using \Delta^N to uniformly sample from other interesting sets. By the way, this is a bonus section, so feel free to ignore it and just call it a day.

Stochastic Matrices

Stochastic matrices helped us a lot during this post, specially as a tool to understand the invariants related to the simplex. Let’s switch our view during this section to put them on the spotlight. I think they deserve it.

How to sample uniformly from \mathbf{Sto}(N)? Well, their only requirement is that each column must be a probability vector — an element of the simplex. Thus, the stochastic matrices are equivalent to N independent copies of the simplex,

\mathbf{Sto}(N) \cong \prod_{i = 1}^N \Delta^N.

Drawing each component independently samples from it uniformly.

function sample_sto(n)
  # Sample n vectors uniformly from simplex
  Ps = (sample_simplex(n) for i = 1 in 1:n)
  # Concatenate vectors into columns of a matrix
  return reduce(hcat, Ps)
end

1-norm unit sphere

This whole post i deeply linked to the 1-norm

\left\lVert x\right\rVert_1 = \sum_k \left|x_k\right|.

The 1-norm unit sphere is the set of all points whose 1-norm equals 1,

\partial B^N_1 = \{\, x \in \R^N \mid \left\lVert x\right\rVert_1 = 1 \,\}.

This is a collage of identical simplexes, one for each orthant. As their intersections have zero measure, you sample uniformly from the sphere by drawing a point in the simplex and throwing N coins to decide the component’s sign.

function sample_1_sphere(n)
  Z = sample_simplex(n)    # Point in the simplex
  S = rand([-1, 1], n)     # n choices of positive/negative

  return Z .* S            # Pointwise product
end

Triangulated Regions and Polyhedra

Triangulations are inescapable in geometry: from algebraic topology to computer graphics, you will eventually bump into them. They also help us sample from polyhedra — even when non-convex.

To put it simply, a triangulation of a compact K \subset \R^N is a finite selection of simplices \mathcal{T} such that their union covers K and all pairwise intersections have dimension less than K (therefore, zero measure).4

The method in this topic is an evolution of the one for the 1-sphere. First, assign a triangulation to the polyhedron K. Then, by weighting a choice of simplex from the triangulation by their volume and choosing uniformly from that simplex, we get a uniform distribution on the whole triangulation.

A non-standard simplex S is a convex combination of affinely independent vectors y_i,

S = \{\, x \mid \exists p \in \Delta^N, x = \sum_i p_i y_i \,\}.

You can generate it from a standard simplex by mapping the canonical basis to the vectors spanning S. From our previous discussions, you already know this is a matrix Y whose columns are the vectors y_i. Then the variable YP is uniformly distributed on S whenever P \sim \mathrm{Unif}(\Delta^N). With this variable and the weighting process just discussed, we get a uniform distribution on the whole triangulation.

function sample_polyhedron(K)
  T  = triangulate(K)         # Assuming you have a method to do it
  ws = Dist=> volume(τ) / volume(K) for t in T)
  Y  = sample(ws)             # Random triangle from K
  P  = sample_simplex(dim(K)) # Uniform on simplex

  return matrix(Y)*P          # Y*P = Σ P_iY_i
end

Conclusion

Very well, today’s journey has come to an end. I hope you had fun exploring how a seemingly strange problem in statistics could be reduced to a geometrical problem about symmetries. I know that I did.

To wrap up, I should point out the similarity between today’s derivation the and Herschel-Maxwell’s theorem for rotation-invariant distributions. In particular, because the sum equals the L^1 norm in the first quadrant. Does this mean something deeper? I don’t really know, but I bet so. What I know is that one can sample from any p-norm sphere5 using a distribution proportional to e^{-\left\lVert x\right\rVert_p^p}. The problem is that finding good transformations preserving general p-norms is not so direct. To be fair, I can’t think of anything but permutations. That’s a topic for a future post, perhaps.

Farewell and good sampling for y’all!

Acknowledgements

Many thanks to Ivani Ivanova for her comments and effort proofreading this post.


  1. In fact, people in probability generally work with left stochastic matrices, because they like multiplying vectors from the left, i.e. p M = p, and thus require the rows to be probability vectors. The matrices in this post are right stochastic because we multiply vectors like normal people. No confusion should arise, because in this post we’ll only use the right kind.↩︎

  2. Continuous random variables with density f(x) = \lambda e^{-\lambda x} \mathbb{I}_{[0, \infty)}(x).↩︎

  3. See this link for the deduction of Maxwell’s theorem from which inspired this proof.↩︎

  4. There are more details, such a requiring that the intersection be themselves simplices. But we’re already getting too much outside the scope of this post.↩︎

  5. Franck Barthe et al., “A Probabilistic Approach to the Geometry of the ℓpn-Ball,” The Annals of Probability 33, no. 2 (March 2005), https://doi.org/10.1214/009117904000000874.↩︎

Spread the Word