29 December 2021

I have a confession to make: I never really quite grasped how to think about expected values. Of course, I use it a lot (really really a lot) and I know pretty well how useful it is, but I never had a good intuition for it. In fact, I always thought that its importance was purely operational: the average is easy to calculate, has plenty of applications and there are a lot of powerful theorems related to it.

Nevertheless, this all changed earlier this year. I was taking a
shower^{1}, not even thinking about math when I
was suddenly struck by a bliss of clarity: *the average minimizes the
squared error*—I said to myself. This seemly simple statement is
present (normally as an exercise) in practically any introductory
probability book, but I had never realized that it implies so much! From
this, we can construct the average not by a smart formula but
geometrically using a variational principle.

Let’s look at it from an information point of view: suppose you are
working with a random variable X and
you don’t have where to store all its information. In fact, for the sake
of this example, you can only store *a single number* to
represent it. Which number should you choose? You probably already know
it is \mathbb{E}[X]… But why? Well…
Because it is in fact the best possible approximation. This post will be
a digression on this theme:

The closest constant to a random variable is its expected value.

Of course, there are some things lacking on the theorem above. How do you measure the distance between random variables? Moreover, constants are numbers and random variables are functions. They have different types! Then, what does it mean to them to be close together? The usual probabilistic view doesn’t emphasize this, but we can interpret random variables geometrically as points in space and, then, measuring their distance becomes as simple as taking the length of the line segment connecting them. Indeed, if we know the random variables, we can even measure this length through nothing more, nothing less than the Pythagorean Theorem!

Let’s begin by laying the groundwork. Suppose that we’re considering an event with a finite amount (let’s say N) of possible outcomes \Omega = \{\omega_1,\ldots,\omega_N\}, each occurring with a probability p_i. You can think of it as throwing a die, pulling a slot machine’s lever or any other situation where uncertain stuff may happen.

In this framework, we can think of a *random variable* as a
function that for each possible outcome assigns a real number^{2}, X \colon
\Omega \to \mathbb{R}. As an example, a possible random variable
is *how much* cash each outcome of a slot machine gives you. Now
comes the **most important part**: these are just
real-valued functions, thus summing them or scaling by a number always
amounts to another random variable:

\begin{aligned} (X + Y)(\omega) &= X(\omega) + Y(\omega) \\ (kX)(\omega) &= k \cdot X(\omega) \end{aligned}

This means that our random variables form a *Vector Space*!
For us, the beauty of this is that we can use the tools of good ol’
Euclidean Geometry to understand how they work. And I don’t know for
you, but I always feel more at home when working in the land of Linear
Algebra.

A nice consequence of having only a finite amount of possible
outcomes is that the space of random variables is actually
finite-dimensional! The *indicator functions* of the outcomes
form a basis to this space:

\mathbb{I}_i(\omega_j) = \begin{cases} 1,& i = j \\ 0,& \text{otherwise}. \end{cases}

The coefficients of a random variable in this basis are also pretty straightforward. They are just the values it returns for each outcome. That is, if X \colon \Omega \to \mathbb{R} returns the value x_i when we sample an outcome w_i, then it is decomposable as

X = \sum_{j = 1}^N x_j \mathbb{I}_j.

In two dimensions, we can also visualize this as a figure:

Because we are interested in a geometrical definition for the mean,
let’s consider a special example: random variables that always return
the same value no matter what outcome we sample. These are called
*constant random variables*, and as you can image from the
introduction, are of uttermost importance for what we are doing in this
post. So, if X \colon \Omega \to
\mathbb{R} always returns c, it
can be decomposed as

C = \sum_{j = 1}^N c \mathbb{I}_j = \underbrace{c}_{\text{constant}} \cdot \underbrace{\left(\sum_{j=1}^N \mathbb{I}_j\right)}_{\text{direction}}.

Or, as a figure:

From this we see that all constants lie on the same line (one-dimensional space), given by the diagonal of all indicators. Remember that we want a method to find the closest constant to a random variable. Intuitively, we can think of this procedure as projecting the random variable into the line of constants. So, let’s proceed by considering how to project vectors into subspaces.

Possibly you noticed that until now, we never used the probabilities. The random variables represent values attached to some non-deterministic outcome however we haven’t used any notion of which outcomes are more probable nor how they relate to each other.

Another thing you may have noticed is that the previous section was
all function algebra, without any real geometry happening. There were no
angles, distances nor anything like that. Very well, this is the time to
fix things up by killing two birds with one stone. Our solution will use
the *probabilities* of the outcomes to define an *inner
product* on the space of random variables.

Now it’s modeling time! How can we embed the probabilistic structure into an inner product? To have an inner product that somehow reflects the probabilities, we will ask that it satisfies some coherence conditions.

We want the inner product of two random variables X and Y to only depend on the probability p_i if both of them are non-zero for the outcome \omega_i. This restriction represents that the information for one possible outcome is only important for this outcome. Some high-level consequences of this restriction are:

Random variables with disjoint supports (That is, every time one of them is zero, the other is non-zero) are

**orthogonal**.The norm of a random variable only depends on the outcomes for whom it is

**non-zero**.

Now more concretely: how does this shape the inner product? It is
completely determined by how it acts on a basis, so let’s check these
properties for the *indicators*. First, the norm of \mathbb{I}_j can only depend on the
probability p_j. Also, since they are
supported on different outcomes, this definition forces the \mathbb{I}_j to form an orthogonal basis!

\left\langle\mathbb{I}_i, \mathbb{I}_j\right\rangle = \begin{cases} f(p_i), & i = j \\ 0,& \text{otherwise}. \end{cases}

Where f \colon \mathbb{R}\to
\mathbb{R} is a yet undetermined function. To totally define this
inner product, we must enforce another property: we require
deterministic objects to be unaware of probabilistic structure. Since
the only way for a random variable to be deterministic is being
constant, this translates to: *if C
is a constant random variable, then its norm \left\lVert C\right\rVert_2 doesn’t depend on
any p_i.* Moreover, for the sake of
consistency, let’s also require the norm of a constant random variable
to be precisely the value that it always returns. In math symbols: If
for all \omega, C(\omega) = c then \left\lVert C\right\rVert_2 = c. Now let’s
use this new property to determine the inner product values.

\begin{aligned} \left\lVert C\right\rVert_2 &= \sqrt{\left\langle C,C\right\rangle} \\ &= \sqrt{\left\langle c \textstyle\sum_i \mathbb{I}_i, c \textstyle\sum_j \mathbb{I}_j\right\rangle} \\ &= c \sqrt{\textstyle\sum_{i,j}\left\langle\mathbb{I}_i, \mathbb{I}_j\right\rangle} \\ &= c \sqrt{\textstyle\sum_j f(p_j)} \end{aligned}

Because the p_j are the coefficients of a probability distribution, they must sum to one. Thus, a good choice for f that satisfies our desired properties is to set equal to the identity function. This way we get

\left\lVert C\right\rVert_2 = c \sqrt{\textstyle\sum_j f(p_j)} = c \sqrt{\textstyle\sum_j p_j} = c \cdot 1 = c

Now we finally have an inner product that coherently represents the probabilistic structure of our random variables! On the indicators basis, it is defined as

\left\langle\mathbb{I}_i, \mathbb{I}_j\right\rangle = \begin{cases} p_i, & i = j \\ 0,& \text{otherwise}. \end{cases}

While for general X and Y, we can use linearity to discover that this amounts to

\left\langle X, Y\right\rangle = \sum_{j} p_j x_j y_j.

Recall that what an inner product essentially defines are notions of
distances and angles between vectors. Particularly, the distance is
given by \mathrm{dist}({X},{ Y}) =
\sqrt{\left\lVert X-Y\right\rVert_2}. Through it, we finally have
the necessary tools to rigorously describe the **average**
using a variational principle, as we talked about on the
introduction.

Let X be a random variable. We call
its *average* \mathbb{E}[X] the
value of the constant random variable that is closest to it.

\begin{array}{rl} \mathbb{E}[X] = \argmin\limits_{c} & \mathrm{dist}({X},{ C}) \\ \textrm{s.t.} & C(\omega) = c, \forall \omega \end{array}

Let’s distill this definition a bit. For us “closest” means “vector
that minimizing the distance”. Remembering that all constant random
variables lie on the same line, we get from Geometry/Linear Algebra that
\mathbb{E}[X] is the result of the
*orthogonal projection* of X on
\sum_j \mathbb{I}_j. (which is a
unitary vector) What we really want to calculate can be summarized as a
picture:

In order to represent this as a closed formula, we must remember that
squaring a non-negative function does not change the point that
minimizes it (because it is convex and monotone). Our previous
definition can then be more concretely written as this *least squares
problem*:

\begin{array}{rl} \mathbb{E}[X] = \argmin\limits_{c} & \frac{1}{2}\left\lVert X-C\right\rVert_2^2 \\ \textrm{s.t.} & C = c \sum_j \mathbb{I}_j. \end{array}

Least Squares notoriously have a closed form solution but let’s
derive it on this particular case. First we turn it into an
unconstrained problem by substituting C
in the objective function; which makes it look like \frac{1}{2}\sum_j p_i (x_i - c)^2. Since this
is convex, all we need to find the minimizer is applying the good old
method of differentiating **with respect to c** and
equating to zero.

\begin{aligned} 0 &= \frac{d}{dc} \left(\frac{1}{2}\textstyle\sum_i p_i (x_i - c)^2\right) \\ &= \frac{1}{2}\textstyle\sum_i p_i \cdot 2 (x_i - c) \\ &= \textstyle\sum_i p_i x_i - \textstyle\sum_i p_i c \\ &= \left(\textstyle\sum_i p_i x_i\right) - c \end{aligned}

By moving the scalar c to the
equation’s left-hand side, we finally found a formula for the expected
value. And, of course, it meets our expectation!^{3}

\boxed{\mathbb{E}[X] = \sum_{i=1}^N p_i x_i}

We just defined the average of X
through the closest constant to it. A question that naturally arises is
how well this represents X. We can
answer this by looking to the projection’s *error*, that is, the
distance between X and \mathbb{E}[X]. (In what follows, we sometimes
overload the notation \mathbb{E}[X] to
also mean the constant random variable with norm \mathbb{E}[X]. What we are talking about
should be clear from context)

\begin{aligned} \text{error} &= X - \mathbb{E}[X], \\ \sigma[X] &= \left\lVert\text{error}\right\rVert_2 = \mathrm{dist}({X},{ \mathbb{E}[X]}) \\ &= \sqrt{\sum_{i=1}^N p_i (x_i - \mathbb{E}[X])^2}. \end{aligned}

You probably already know this formula by the name of
**standard deviation**! Although in probability textbooks
it appears as the concise but opaque formula \sigma[X] = \sqrt{\mathbb{E}[(X -
\mathbb{E}[X])^2]}, here it naturally appears as the size of the
least squares’ error from approximating X by a constant.

In real world probability, one of the most widely used kinds of
random variables are certainly the *Gaussian Random Variables*.
Among their fundamental properties is that their distribution is
uniquely defined by its mean and standard deviation^{4}.
This part is only my opinion and no mathematics, but I like to think
that what makes them so special is that they can be recovered solely by
knowing their approximation by a constant and how far off this
approximation is.

Until now, we have only been looking for distances and orthogonal projections. Nevertheless, the inner product also tells us about the angle between two vectors! From Euclidean Geometry, we know that the angle \theta between two vectors satisfies

\left\langle A,B\right\rangle = \left\lVert A\right\rVert_2\left\lVert B\right\rVert_2 \cos\theta.

An interesting use of this angle is in calculating the
**correlation** between random variables. To find it, we
first calculate the errors of approximating X and Y by
their means. Then, the correlation is defined exactly as the cosine of
the *angle* between these errors. That is, if we let \theta equal the angle between X - \mathbb{E}[X] and Y - \mathbb{E}[Y], then

\begin{aligned} \mathrm{corr}[X, Y] &= \cos(\theta) \\ &= \frac{\left\langle X - \mathbb{E}[X], Y - \mathbb{E}[Y]\right\rangle}{\sigma[X]\sigma[Y]} \end{aligned}

What does the correlation mean, you may ask? Well, in usual statistical parlance, the farther the correlation is from zero, the closer the random variables are from being linear functions from one another, i.e. Y = aX + b.

Up until this point, we only considered distances the “classical” or Euclidean sense. But sometimes it is not the most appropriate way to measure how far things are. For example, if you are driving in a city, there is no way to cross diagonally between blocks. The best way to describe the distance between two points is through the size of the horizontal and vertical increments one has to traverse from one point to another. Thus let’s talk a bit about these more “exotic” distances.

Coming from our example above, we define the L^1-norm^{5} of X analogously to the Euclidean norm as

\left\lVert X\right\rVert_1 = \sum_{i = 1}^N p_i |x_i|.

This is just like the Euclidean norm we were talking about but exchanging the squares by absolute values. Instead of measuring the length of the line segment between points, it measures the length we must traverse if we could only walk through a grid parallel to the indicator functions \mathbb{I}_j. This distance is widely used on the literature for its robustness and sparsity-inducing properties. Also, while the Euclidean distance is rotation invariant, the L^1-norm clearly has some preference towards the outcomes’ indicators. So, our variational principle using this distance is

\begin{array}{rl} \min\limits_{c} & \sum_{i=1}^N p_i |x_i - c| \end{array}

This optimization problem can be rewritten as a *linear
program* as has a (possibly non-unique) solution that is also used
all around Statistics: the **median**. This is a constant
\mu with the nice property \mathrm{P}(X \le \mu) \ge \frac{1}{2} and
\mathrm{P}(X \ge \mu) \ge
\frac{1}{2}.

For any measure of distance between random variables between points, there is some associated constant that minimizes it and, in general, is already an important object in Statistics. As an example, try to discover which is the closest constant to X by using the \infty-norm: \left\lVert X-Y\right\rVert_\infty = \max_\omega P(\omega)|X(\omega) - Y(\omega)|. In fact, we don’t even have to use a norm! Any meaningful notion of distance can to be used to project (non-linearly) on the space of constants.

As a last example before we finish, let’s consider a distance that
does not come from a norm. Consider \mathrm{dist}({X},{ Y}) = \mathrm{P}(X \ne
Y). This is an interesting distance and widely used in signal
processing because when Y = 0, it
measures the sparsity of X as a vector.
Can you guess what constant minimizes it? Try to prove that it is the
*mode*^{6} of X!

When I started thinking of these ideas, I talked about them to Hugo Nobrega and João Paixão. This post only came to life because of how excited they got with it. Even so that they convinced me it could be interesting to write about it.

I also want to thank Daniel Hashimoto and Ivani Ivanova for reading the first version of this post and helping me with their “bug reports”.