# Shower Thoughts about Averages

29 December 2021

I have a confession to make: I never really quite grasped how to think about expected values. Of course, I use it a lot (really a lot) and I know pretty well how useful it is, but I never had a good intuition for it. In fact, I always thought that its importance was purely operational: the average is easy to calculate, has plenty of applications and there are a lot of powerful theorems related to it.

Nevertheless, this all changed earlier this year. I
was taking a shower^{1}, not
even thinking about math when I was suddenly struck by a
bliss of clarity: *the average minimizes the squared
error*—I said to myself. This seemly simple
statement is present (normally as an exercise) in
practically any introductory probability book, but I had
never realized that it implies so much! From this, we
can construct the average not by a smart formula but
geometrically using a variational principle.

Let’s look at it from an information point of view:
suppose you are working with a random variable X and you don’t have where to
store all its information. In fact, for the sake of this
example, you can only store *a single number* to
represent it. Which number should you choose? You
probably already know it is \mathbb{E}[X]… But why? Well…
Because it is in fact the best possible approximation.
This post will be a digression on this theme:

The closest constant to a random variable is its expected value.

Of course, there are some things lacking on the theorem above. How do you measure the distance between random variables? Moreover, constants are numbers and random variables are functions. They have different types! Then, what does it mean to them to be close together? The usual probabilistic view doesn’t emphasize this, but we can interpret random variables geometrically as points in space and, then, measuring their distance becomes as simple as taking the length of the line segment connecting them. Indeed, if we know the random variables, we can even measure this length through nothing more, nothing less than the Pythagorean Theorem!

## The Geometry of Random Variables

Let’s begin by laying the groundwork. Suppose that we’re considering an event with a finite amount (let’s say N) of possible outcomes \Omega = \{\omega_1,\ldots,\omega_N\}, each occurring with a probability p_i. You can think of it as throwing a die, pulling a slot machine’s lever or any other situation where uncertain stuff may happen.

In this framework, we can think of a *random
variable* as a function that for each possible
outcome assigns a real number^{2}, X \colon \Omega \to
\mathbb{R}. As an example, a possible random
variable is *how much* cash each outcome of a
slot machine gives you. Now comes the **most
important part**: these are just real-valued
functions, thus summing them or scaling by a number
always amounts to another random variable:

\begin{aligned} (X + Y)(\omega) &= X(\omega) + Y(\omega) \\ (kX)(\omega) &= k \cdot X(\omega) \end{aligned}

This means that our random variables form a
*Vector Space*! For us, the beauty of this is
that we can use the tools of good ol’ Euclidean Geometry
to understand how they work. And I don’t know for you,
but I always feel more at home when working in the land
of Linear Algebra.

A nice consequence of having only a finite amount of
possible outcomes is that the space of random variables
is actually finite-dimensional! The *indicator
functions* of the outcomes form a basis to this
space:

\mathbb{I}_i(\omega_j) = \begin{cases} 1,& i = j \\ 0,& \text{otherwise}. \end{cases}

The coefficients of a random variable in this basis are also pretty straightforward. They are just the values it returns for each outcome. That is, if X \colon \Omega \to \mathbb{R} returns the value x_i when we sample an outcome w_i, then it is decomposable as

X = \sum_{j = 1}^N x_j \mathbb{I}_j.

In two dimensions, we can also visualize this as a figure:

Because we are interested in a geometrical definition
for the mean, let’s consider a special example: random
variables that always return the same value no matter
what outcome we sample. These are called *constant
random variables*, and as you can image from the
introduction, are of uttermost importance for what we
are doing in this post. So, if X \colon \Omega \to
\mathbb{R} always returns c, it can be decomposed
as

C = \sum_{j = 1}^N c \mathbb{I}_j = \underbrace{c}_{\text{constant}} \cdot \underbrace{\left(\sum_{j=1}^N \mathbb{I}_j\right)}_{\text{direction}}.

Or, as a figure:

From this we see that all constants lie on the same line (one-dimensional space), given by the diagonal of all indicators. Remember that we want a method to find the closest constant to a random variable. Intuitively, we can think of this procedure as projecting the random variable into the line of constants. So, let’s proceed by considering how to project vectors into subspaces.

### Where are all the probabilities?

Possibly you noticed that until now, we never used the probabilities. The random variables represent values attached to some non-deterministic outcome however we haven’t used any notion of which outcomes are more probable nor how they relate to each other.

Another thing you may have noticed is that the
previous section was all function algebra, without any
real geometry happening. There were no angles, distances
nor anything like that. Very well, this is the time to
fix things up by killing two birds with one stone. Our
solution will use the *probabilities* of the
outcomes to define an *inner product* on the
space of random variables.

Now it’s modeling time! How can we embed the probabilistic structure into an inner product? To have an inner product that somehow reflects the probabilities, we will ask that it satisfies some coherence conditions.

We want the inner product of two random variables X and Y to only depend on the probability p_i if both of them are non-zero for the outcome \omega_i. This restriction represents that the information for one possible outcome is only important for this outcome. Some high-level consequences of this restriction are:

Random variables with disjoint supports (That is, every time one of them is zero, the other is non-zero) are

**orthogonal**.The norm of a random variable only depends on the outcomes for whom it is

**non-zero**.

Now more concretely: how does this shape the inner
product? It is completely determined by how it acts on a
basis, so let’s check these properties for the
*indicators*. First, the norm of \mathbb{I}_j can only depend
on the probability p_j.
Also, since they are supported on different outcomes,
this definition forces the \mathbb{I}_j to form an
orthogonal basis!

\left\langle\mathbb{I}_i, \mathbb{I}_j\right\rangle = \begin{cases} f(p_i), & i = j \\ 0,& \text{otherwise}. \end{cases}

Where f \colon
\mathbb{R}\to \mathbb{R} is a yet undetermined
function. To totally define this inner product, we must
enforce another property: we require deterministic
objects to be unaware of probabilistic structure. Since
the only way for a random variable to be deterministic
is being constant, this translates to: *if C is a constant random
variable, then its norm \left\lVert C\right\rVert_2
doesn’t depend on any p_i.* Moreover, for the
sake of consistency, let’s also require the norm of a
constant random variable to be precisely the value that
it always returns. In math symbols: If for all \omega, C(\omega) = c then \left\lVert C\right\rVert_2 =
c. Now let’s use this new property to determine
the inner product values.

\begin{aligned} \left\lVert C\right\rVert_2 &= \sqrt{\left\langle C,C\right\rangle} \\ &= \sqrt{\left\langle c \textstyle\sum_i \mathbb{I}_i, c \textstyle\sum_j \mathbb{I}_j\right\rangle} \\ &= c \sqrt{\textstyle\sum_{i,j}\left\langle\mathbb{I}_i, \mathbb{I}_j\right\rangle} \\ &= c \sqrt{\textstyle\sum_j f(p_j)} \end{aligned}

Because the p_j are the coefficients of a probability distribution, they must sum to one. Thus, a good choice for f that satisfies our desired properties is to set equal to the identity function. This way we get

\left\lVert C\right\rVert_2 = c \sqrt{\textstyle\sum_j f(p_j)} = c \sqrt{\textstyle\sum_j p_j} = c \cdot 1 = c

Now we finally have an inner product that coherently represents the probabilistic structure of our random variables! On the indicators basis, it is defined as

\left\langle\mathbb{I}_i, \mathbb{I}_j\right\rangle = \begin{cases} p_i, & i = j \\ 0,& \text{otherwise}. \end{cases}

While for general X and Y, we can use linearity to discover that this amounts to

\left\langle X, Y\right\rangle = \sum_{j} p_j x_j y_j.

## What about the mean?

Recall that what an inner product essentially defines
are notions of distances and angles between vectors.
Particularly, the distance is given by \mathrm{dist}({X},{ Y}) =
\sqrt{\left\lVert X-Y\right\rVert_2}. Through it,
we finally have the necessary tools to rigorously
describe the **average** using a
variational principle, as we talked about on the
introduction.

Let X be a random
variable. We call its *average* \mathbb{E}[X] the value of
the constant random variable that is closest to it.

\begin{array}{rl} \mathbb{E}[X] = \argmin\limits_{c} & \mathrm{dist}({X},{ C}) \\ \textrm{s.t.} & C(\omega) = c, \forall \omega \end{array}

Let’s distill this definition a bit. For us “closest”
means “vector that minimizing the distance”. Remembering
that all constant random variables lie on the same line,
we get from Geometry/Linear Algebra that \mathbb{E}[X] is the result
of the *orthogonal projection* of X on \sum_j \mathbb{I}_j. (which
is a unitary vector) What we really want to calculate
can be summarized as a picture:

In order to represent this as a closed formula, we
must remember that squaring a non-negative function does
not change the point that minimizes it (because it is
convex and monotone). Our previous definition can then
be more concretely written as this *least squares
problem*:

\begin{array}{rl} \mathbb{E}[X] = \argmin\limits_{c} & \frac{1}{2}\left\lVert X-C\right\rVert_2^2 \\ \textrm{s.t.} & C = c \sum_j \mathbb{I}_j. \end{array}

Least Squares notoriously have a closed form solution
but let’s derive it on this particular case. First we
turn it into an unconstrained problem by substituting
C in the objective
function; which makes it look like \frac{1}{2}\sum_j p_i (x_i -
c)^2. Since this is convex, all we need to find
the minimizer is applying the good old method of
differentiating **with respect to c** and
equating to zero.

\begin{aligned} 0 &= \frac{d}{dc} \left(\frac{1}{2}\textstyle\sum_i p_i (x_i - c)^2\right) \\ &= \frac{1}{2}\textstyle\sum_i p_i \cdot 2 (x_i - c) \\ &= \textstyle\sum_i p_i x_i - \textstyle\sum_i p_i c \\ &= \left(\textstyle\sum_i p_i x_i\right) - c \end{aligned}

By moving the scalar c to the equation’s left-hand
side, we finally found a formula for the expected value.
And, of course, it meets our expectation!^{3}

\boxed{\mathbb{E}[X] = \sum_{i=1}^N p_i x_i}

### Let’s gather more old friends

We just defined the average of X through the closest
constant to it. A question that naturally arises is how
well this represents X.
We can answer this by looking to the projection’s
*error*, that is, the distance between X and \mathbb{E}[X]. (In what
follows, we sometimes overload the notation \mathbb{E}[X] to also mean
the constant random variable with norm \mathbb{E}[X]. What we are
talking about should be clear from context)

\begin{aligned} \text{error} &= X - \mathbb{E}[X], \\ \sigma[X] &= \left\lVert\text{error}\right\rVert_2 = \mathrm{dist}({X},{ \mathbb{E}[X]}) \\ &= \sqrt{\sum_{i=1}^N p_i (x_i - \mathbb{E}[X])^2}. \end{aligned}

You probably already know this formula by the name of
**standard deviation**! Although in
probability textbooks it appears as the concise but
opaque formula \sigma[X] =
\sqrt{\mathbb{E}[(X - \mathbb{E}[X])^2]}, here it
naturally appears as the size of the least squares’
error from approximating X by a constant.

In real world probability, one of the most widely
used kinds of random variables are certainly the
*Gaussian Random Variables*. Among their
fundamental properties is that their distribution is
uniquely defined by its mean and standard deviation^{4}. This part is only
my opinion and no mathematics, but I like to think that
what makes them so special is that they can be recovered
solely by knowing their approximation by a constant and
how far off this approximation is.

Until now, we have only been looking for distances and orthogonal projections. Nevertheless, the inner product also tells us about the angle between two vectors! From Euclidean Geometry, we know that the angle \theta between two vectors satisfies

\left\langle A,B\right\rangle = \left\lVert A\right\rVert_2\left\lVert B\right\rVert_2 \cos\theta.

An interesting use of this angle is in calculating
the **correlation** between random
variables. To find it, we first calculate the errors of
approximating X and
Y by their means. Then,
the correlation is defined exactly as the cosine of the
*angle* between these errors. That is, if we let
\theta equal the angle
between X -
\mathbb{E}[X] and Y -
\mathbb{E}[Y], then

\begin{aligned} \mathrm{corr}[X, Y] &= \cos(\theta) \\ &= \frac{\left\langle X - \mathbb{E}[X], Y - \mathbb{E}[Y]\right\rangle}{\sigma[X]\sigma[Y]} \end{aligned}

What does the correlation mean, you may ask? Well, in usual statistical parlance, the farther the correlation is from zero, the closer the random variables are from being linear functions from one another, i.e. Y = aX + b.

## Do we really need those squares?

Up until this point, we only considered distances the “classical” or Euclidean sense. But sometimes it is not the most appropriate way to measure how far things are. For example, if you are driving in a city, there is no way to cross diagonally between blocks. The best way to describe the distance between two points is through the size of the horizontal and vertical increments one has to traverse from one point to another. Thus let’s talk a bit about these more “exotic” distances.

Coming from our example above, we define the L^1-norm^{5} of X analogously to the
Euclidean norm as

\left\lVert X\right\rVert_1 = \sum_{i = 1}^N p_i |x_i|.

This is just like the Euclidean norm we were talking about but exchanging the squares by absolute values. Instead of measuring the length of the line segment between points, it measures the length we must traverse if we could only walk through a grid parallel to the indicator functions \mathbb{I}_j. This distance is widely used on the literature for its robustness and sparsity-inducing properties. Also, while the Euclidean distance is rotation invariant, the L^1-norm clearly has some preference towards the outcomes’ indicators. So, our variational principle using this distance is

\begin{array}{rl} \min\limits_{c} & \sum_{i=1}^N p_i |x_i - c| \end{array}

This optimization problem can be rewritten as a
*linear program* and has a (possibly non-unique)
solution that is also used all around Statistics: the
**median**. This is a constant \mu with the nice property
\mathrm{P}(X \le \mu) \ge
\frac{1}{2} and \mathrm{P}(X \ge \mu) \ge
\frac{1}{2}.

For any measure of distance between random variables between points, there is some associated constant that minimizes it and, in general, is already an important object in Statistics. As an example, try to discover which is the closest constant to X by using the \infty-norm: \left\lVert X-Y\right\rVert_\infty = \max_\omega P(\omega)|X(\omega) - Y(\omega)|. In fact, we don’t even have to use a norm! Any meaningful notion of distance can to be used to project (non-linearly) on the space of constants.

As a last example before we finish, let’s consider a
distance that does not come from a norm. Consider \mathrm{dist}({X},{ Y}) =
\mathrm{P}(X \ne Y). This is an interesting
distance and widely used in signal processing because
when Y = 0, it measures
the sparsity of X as a
vector. Can you guess what constant minimizes it? Try to
prove that it is the *mode*^{6} of X!

## Acknowledgements

When I started thinking of these ideas, I talked about them to Hugo Nobrega and João Paixão. This post only came to life because of how excited they got with it. Even so that they convinced me it could be interesting to write about it.

I also want to thank Daniel Hashimoto and Ivani Ivanova for reading the first version of this post and helping me with their “bug reports”.

I don’t know why, but my best ideas always seem to come while I’m sleeping or in the shower.↩︎

If you know some measure theory, you know that this definition is oversimplified. It is missing the \sigma-algebra, measurability and other complicated stuff. But, well, the focus of this post is on intuition thus I’m ok with sacrificing generality and some rigour for the sake of simplicity. Nevertheless, we can always think that there is an implicit discrete \sigma-algebra everywhere.↩︎

ba dum tss.↩︎

Or variance if you prefer it squared.↩︎

Also known by the much more stylish names of Manhattan or taxicab’s norm.↩︎

The

*mode*is the value returned with the greatest probability.↩︎