Shower Thoughts about Averages

29 December 2021

I have a confession to make: I never really quite grasped how to think about expected values. Of course, I use it a lot (really really a lot) and I know pretty well how useful it is, but I never had a good intuition for it. In fact, I always thought that its importance was purely operational: the average is easy to calculate, has plenty of applications and there are a lot of powerful theorems related to it.

Nevertheless, this all changed earlier this year. I was taking a shower1, not even thinking about math when I was suddenly struck by a bliss of clarity: the average minimizes the squared error—I said to myself. This seemly simple statement is present (normally as an exercise) in practically any introductory probability book, but I had never realized that it implies so much! From this, we can construct the average not by a smart formula but geometrically using a variational principle.

Let’s look at it from an information point of view: suppose you are working with a random variable X and you don’t have where to store all its information. In fact, for the sake of this example, you can only store a single number to represent it. Which number should you choose? You probably already know it is \mathbb{E}[X]… But why? Well… Because it is in fact the best possible approximation. This post will be a digression on this theme:

The closest constant to a random variable is its expected value.

Of course, there are some things lacking on the theorem above. How do you measure the distance between random variables? Moreover, constants are numbers and random variables are functions. They have different types! Then, what does it mean to them to be close together? The usual probabilistic view doesn’t emphasize this, but we can interpret random variables geometrically as points in space and, then, measuring their distance becomes as simple as taking the length of the line segment connecting them. Indeed, if we know the random variables, we can even measure this length through nothing more, nothing less than the Pythagorean Theorem!

The Geometry of Random Variables

Let’s begin by laying the groundwork. Suppose that we’re considering an event with a finite amount (let’s say N) of possible outcomes \Omega = \{\omega_1,\ldots,\omega_N\}, each occurring with a probability p_i. You can think of it as throwing a die, pulling a slot machine’s lever or any other situation where uncertain stuff may happen.

In this framework, we can think of a random variable as a function that for each possible outcome assigns a real number2, X \colon \Omega \to \mathbb{R}. As an example, a possible random variable is how much cash each outcome of a slot machine gives you. Now comes the most important part: these are just real-valued functions, thus summing them or scaling by a number always amounts to another random variable:

\begin{aligned} (X + Y)(\omega) &= X(\omega) + Y(\omega) \\ (kX)(\omega) &= k \cdot X(\omega) \end{aligned}

This means that our random variables form a Vector Space! For us, the beauty of this is that we can use the tools of good ol’ Euclidean Geometry to understand how they work. And I don’t know for you, but I always feel more at home when working in the land of Linear Algebra.

A nice consequence of having only a finite amount of possible outcomes is that the space of random variables is actually finite-dimensional! The indicator functions of the outcomes form a basis to this space:

\mathbb{I}_i(\omega_j) = \begin{cases} 1,& i = j \\ 0,& \text{otherwise}. \end{cases}

The coefficients of a random variable in this basis are also pretty straightforward. They are just the values it returns for each outcome. That is, if X \colon \Omega \to \mathbb{R} returns the value x_i when we sample an outcome w_i, then it is decomposable as

X = \sum_{j = 1}^N x_j \mathbb{I}_j.

In two dimensions, we can also visualize this as a figure:

Because we are interested in a geometrical definition for the mean, let’s consider a special example: random variables that always return the same value no matter what outcome we sample. These are called constant random variables, and as you can image from the introduction, are of uttermost importance for what we are doing in this post. So, if X \colon \Omega \to \mathbb{R} always returns c, it can be decomposed as

C = \sum_{j = 1}^N c \mathbb{I}_j = \underbrace{c}_{\text{constant}} \cdot \underbrace{\left(\sum_{j=1}^N \mathbb{I}_j\right)}_{\text{direction}}.

Or, as a figure:

From this we see that all constants lie on the same line (one-dimensional space), given by the diagonal of all indicators. Remember that we want a method to find the closest constant to a random variable. Intuitively, we can think of this procedure as projecting the random variable into the line of constants. So, let’s proceed by considering how to project vectors into subspaces.

Where are all the probabilities?

Possibly you noticed that until now, we never used the probabilities. The random variables represent values attached to some non-deterministic outcome however we haven’t used any notion of which outcomes are more probable nor how they relate to each other.

Another thing you may have noticed is that the previous section was all function algebra, without any real geometry happening. There were no angles, distances nor anything like that. Very well, this is the time to fix things up by killing two birds with one stone. Our solution will use the probabilities of the outcomes to define an inner product on the space of random variables.

Now it’s modeling time! How can we embed the probabilistic structure into an inner product? To have an inner product that somehow reflects the probabilities, we will ask that it satisfies some coherence conditions.

We want the inner product of two random variables X and Y to only depend on the probability p_i if both of them are non-zero for the outcome \omega_i. This restriction represents that the information for one possible outcome is only important for this outcome. Some high-level consequences of this restriction are:

Now more concretely: how does this shape the inner product? It is completely determined by how it acts on a basis, so let’s check these properties for the indicators. First, the norm of \mathbb{I}_j can only depend on the probability p_j. Also, since they are supported on different outcomes, this definition forces the \mathbb{I}_j to form an orthogonal basis!

\left\langle\mathbb{I}_i, \mathbb{I}_j\right\rangle = \begin{cases} f(p_i), & i = j \\ 0,& \text{otherwise}. \end{cases}

Where f \colon \mathbb{R}\to \mathbb{R} is a yet undetermined function. To totally define this inner product, we must enforce another property: we require deterministic objects to be unaware of probabilistic structure. Since the only way for a random variable to be deterministic is being constant, this translates to: if C is a constant random variable, then its norm \left\lVert C\right\rVert_2 doesn’t depend on any p_i. Moreover, for the sake of consistency, let’s also require the norm of a constant random variable to be precisely the value that it always returns. In math symbols: If for all \omega, C(\omega) = c then \left\lVert C\right\rVert_2 = c. Now let’s use this new property to determine the inner product values.

\begin{aligned} \left\lVert C\right\rVert_2 &= \sqrt{\left\langle C,C\right\rangle} \\ &= \sqrt{\left\langle c \textstyle\sum_i \mathbb{I}_i, c \textstyle\sum_j \mathbb{I}_j\right\rangle} \\ &= c \sqrt{\textstyle\sum_{i,j}\left\langle\mathbb{I}_i, \mathbb{I}_j\right\rangle} \\ &= c \sqrt{\textstyle\sum_j f(p_j)} \end{aligned}

Because the p_j are the coefficients of a probability distribution, they must sum to one. Thus, a good choice for f that satisfies our desired properties is to set equal to the identity function. This way we get

\left\lVert C\right\rVert_2 = c \sqrt{\textstyle\sum_j f(p_j)} = c \sqrt{\textstyle\sum_j p_j} = c \cdot 1 = c

Now we finally have an inner product that coherently represents the probabilistic structure of our random variables! On the indicators basis, it is defined as

\left\langle\mathbb{I}_i, \mathbb{I}_j\right\rangle = \begin{cases} p_i, & i = j \\ 0,& \text{otherwise}. \end{cases}

While for general X and Y, we can use linearity to discover that this amounts to

\left\langle X, Y\right\rangle = \sum_{j} p_j x_j y_j.

What about the mean?

Recall that what an inner product essentially defines are notions of distances and angles between vectors. Particularly, the distance is given by \mathrm{dist}({X},{ Y}) = \sqrt{\left\lVert X-Y\right\rVert_2}. Through it, we finally have the necessary tools to rigorously describe the average using a variational principle, as we talked about on the introduction.

Let X be a random variable. We call its average \mathbb{E}[X] the value of the constant random variable that is closest to it.

\begin{array}{rl} \mathbb{E}[X] = \argmin\limits_{c} & \mathrm{dist}({X},{ C}) \\ \textrm{s.t.} & C(\omega) = c, \forall \omega \end{array}

Let’s distill this definition a bit. For us “closest” means “vector that minimizing the distance”. Remembering that all constant random variables lie on the same line, we get from Geometry/Linear Algebra that \mathbb{E}[X] is the result of the orthogonal projection of X on \sum_j \mathbb{I}_j. (which is a unitary vector) What we really want to calculate can be summarized as a picture:

In order to represent this as a closed formula, we must remember that squaring a non-negative function does not change the point that minimizes it (because it is convex and monotone). Our previous definition can then be more concretely written as this least squares problem:

\begin{array}{rl} \mathbb{E}[X] = \argmin\limits_{c} & \frac{1}{2}\left\lVert X-C\right\rVert_2^2 \\ \textrm{s.t.} & C = c \sum_j \mathbb{I}_j. \end{array}

Least Squares notoriously have a closed form solution but let’s derive it on this particular case. First we turn it into an unconstrained problem by substituting C in the objective function; which makes it look like \frac{1}{2}\sum_j p_i (x_i - c)^2. Since this is convex, all we need to find the minimizer is applying the good old method of differentiating with respect to c and equating to zero.

\begin{aligned} 0 &= \frac{d}{dc} \left(\frac{1}{2}\textstyle\sum_i p_i (x_i - c)^2\right) \\ &= \frac{1}{2}\textstyle\sum_i p_i \cdot 2 (x_i - c) \\ &= \textstyle\sum_i p_i x_i - \textstyle\sum_i p_i c \\ &= \left(\textstyle\sum_i p_i x_i\right) - c \end{aligned}

By moving the scalar c to the equation’s left-hand side, we finally found a formula for the expected value. And, of course, it meets our expectation!3

\boxed{\mathbb{E}[X] = \sum_{i=1}^N p_i x_i}

Let’s gather more old friends

We just defined the average of X through the closest constant to it. A question that naturally arises is how well this represents X. We can answer this by looking to the projection’s error, that is, the distance between X and \mathbb{E}[X]. (In what follows, we sometimes overload the notation \mathbb{E}[X] to also mean the constant random variable with norm \mathbb{E}[X]. What we are talking about should be clear from context)

\begin{aligned} \text{error} &= X - \mathbb{E}[X], \\ \sigma[X] &= \left\lVert\text{error}\right\rVert_2 = \mathrm{dist}({X},{ \mathbb{E}[X]}) \\ &= \sqrt{\sum_{i=1}^N p_i (x_i - \mathbb{E}[X])^2}. \end{aligned}

You probably already know this formula by the name of standard deviation! Although in probability textbooks it appears as the concise but opaque formula \sigma[X] = \sqrt{\mathbb{E}[(X - \mathbb{E}[X])^2]}, here it naturally appears as the size of the least squares’ error from approximating X by a constant.

In real world probability, one of the most widely used kinds of random variables are certainly the Gaussian Random Variables. Among their fundamental properties is that their distribution is uniquely defined by its mean and standard deviation4. This part is only my opinion and no mathematics, but I like to think that what makes them so special is that they can be recovered solely by knowing their approximation by a constant and how far off this approximation is.

Until now, we have only been looking for distances and orthogonal projections. Nevertheless, the inner product also tells us about the angle between two vectors! From Euclidean Geometry, we know that the angle \theta between two vectors satisfies

\left\langle A,B\right\rangle = \left\lVert A\right\rVert_2\left\lVert B\right\rVert_2 \cos\theta.

An interesting use of this angle is in calculating the correlation between random variables. To find it, we first calculate the errors of approximating X and Y by their means. Then, the correlation is defined exactly as the cosine of the angle between these errors. That is, if we let \theta equal the angle between X - \mathbb{E}[X] and Y - \mathbb{E}[Y], then

\begin{aligned} \mathrm{corr}[X, Y] &= \cos(\theta) \\ &= \frac{\left\langle X - \mathbb{E}[X], Y - \mathbb{E}[Y]\right\rangle}{\sigma[X]\sigma[Y]} \end{aligned}

What does the correlation mean, you may ask? Well, in usual statistical parlance, the farther the correlation is from zero, the closer the random variables are from being linear functions from one another, i.e. Y = aX + b.

Do we really need those squares?

Up until this point, we only considered distances the “classical” or Euclidean sense. But sometimes it is not the most appropriate way to measure how far things are. For example, if you are driving in a city, there is no way to cross diagonally between blocks. The best way to describe the distance between two points is through the size of the horizontal and vertical increments one has to traverse from one point to another. Thus let’s talk a bit about these more “exotic” distances.

Coming from our example above, we define the L^1-norm5 of X analogously to the Euclidean norm as

\left\lVert X\right\rVert_1 = \sum_{i = 1}^N p_i |x_i|.

This is just like the Euclidean norm we were talking about but exchanging the squares by absolute values. Instead of measuring the length of the line segment between points, it measures the length we must traverse if we could only walk through a grid parallel to the indicator functions \mathbb{I}_j. This distance is widely used on the literature for its robustness and sparsity-inducing properties. Also, while the Euclidean distance is rotation invariant, the L^1-norm clearly has some preference towards the outcomes’ indicators. So, our variational principle using this distance is

\begin{array}{rl} \min\limits_{c} & \sum_{i=1}^N p_i |x_i - c| \end{array}

This optimization problem can be rewritten as a linear program as has a (possibly non-unique) solution that is also used all around Statistics: the median. This is a constant \mu with the nice property \mathrm{P}(X \le \mu) \ge \frac{1}{2} and \mathrm{P}(X \ge \mu) \ge \frac{1}{2}.

For any measure of distance between random variables between points, there is some associated constant that minimizes it and, in general, is already an important object in Statistics. As an example, try to discover which is the closest constant to X by using the \infty-norm: \left\lVert X-Y\right\rVert_\infty = \max_\omega P(\omega)|X(\omega) - Y(\omega)|. In fact, we don’t even have to use a norm! Any meaningful notion of distance can to be used to project (non-linearly) on the space of constants.

As a last example before we finish, let’s consider a distance that does not come from a norm. Consider \mathrm{dist}({X},{ Y}) = \mathrm{P}(X \ne Y). This is an interesting distance and widely used in signal processing because when Y = 0, it measures the sparsity of X as a vector. Can you guess what constant minimizes it? Try to prove that it is the mode6 of X!


When I started thinking of these ideas, I talked about them to Hugo Nobrega and João Paixão. This post only came to life because of how excited they got with it. Even so that they convinced me it could be interesting to write about it.

I also want to thank Daniel Hashimoto and Ivani Ivanova for reading the first version of this post and helping me with their “bug reports”.

  1. I don’t know why, but my best ideas always seem to come while I’m sleeping or in the shower.↩︎

  2. If you know some measure theory, you know that this definition is oversimplified. It is missing the \sigma-algebra, measurability and other complicated stuff. But, well, the focus of this post is on intuition thus I’m ok with sacrificing generality and some rigour for the sake of simplicity. Nevertheless, we can always think that there is an implicit discrete \sigma-algebra everywhere.↩︎

  3. ba dum tss.↩︎

  4. Or variance if you prefer it squared.↩︎

  5. Also known by the much more stylish names of Manhattan or taxicab’s norm.↩︎

  6. The mode is the value returned with the greatest probability.↩︎

Spread the Word