# A geometric interpretation of the covariance matrix

## Introduction

In this article, we provide an intuitive, geometric interpretation of the covariance matrix, by exploring the relation between linear transformations and the resulting data covariance. Most textbooks explain the shape of data based on the concept of covariance matrices. Instead, we take a backwards approach and explain the concept of covariance matrices based on the shape of data.

In a previous article, we discussed the concept of variance, and provided a derivation and proof of the well known formula to estimate the sample variance. Figure 1 was used in this article to show that the standard deviation, as the square root of the variance, provides a measure of how much the data is spread across the feature space.

We showed that an unbiased estimator of the sample variance can be obtained by:

(1)

However, variance can only be used to explain the spread of the data in the directions parallel to the axes of the feature space. Consider the 2D feature space shown by figure 2:

For this data, we could calculate the variance in the x-direction and the variance in the y-direction. However, the horizontal spread and the vertical spread of the data does not explain the clear diagonal correlation. Figure 2 clearly shows that on average, if the x-value of a data point increases, then also the y-value increases, resulting in a positive correlation. This correlation can be captured by extending the notion of variance to what is called the ‘covariance’ of the data:

(2)

For 2D data, we thus obtain , , and . These four values can be summarized in a matrix, called the covariance matrix:

(3)

If x is positively correlated with y, y is also positively correlated with x. In other words, we can state that . Therefore, the covariance matrix is always a symmetric matrix with the variances on its diagonal and the covariances off-diagonal. Two-dimensional normally distributed data is explained completely by its mean and its covariance matrix. Similarly, a covariance matrix is used to capture the spread of three-dimensional data, and a covariance matrix captures the spread of N-dimensional data.

Figure 3 illustrates how the overall shape of the data defines the covariance matrix:

Now let’s forget about covariance matrices for a moment. Each of the examples in figure 3 can simply be considered to be a linearly transformed instance of figure 4:

Let the data shown by figure 4 be , then each of the examples shown by figure 3 can be obtained by linearly transforming :

(4)

where is a transformation matrix consisting of a rotation matrix and a scaling matrix :

(5)

These matrices are defined as:

(6)

where is the rotation angle, and:

(7)

where and are the scaling factors in the x direction and the y direction respectively.

In the following section, we will discuss the relation between the covariance matrix , and the linear transformation matrix .

## Covariance matrix as a linear transformation

Let’s start with unscaled (scale equals 1) and unrotated data. In statistics this is often refered to as ‘white data’ because its samples are drawn from a standard normal distribution and therefore correspond to white (uncorrelated) noise:

The covariance matrix of this ‘white’ data equals the identity matrix, such that the variances and standard deviations equal 1 and the covariance equals zero:

(8)

Now let’s scale the data in the x-direction with a factor 4:

(9)

The data now looks as follows:

The covariance matrix of is now:

(10)

Thus, the covariance matrix of the resulting data is related to the linear transformation that is applied to the original data as follows: , where

(11)

However, although equation (11) holds when the data is scaled in the x and y direction, the question rises if it also holds when a rotation is applied. To investigate the relation between the linear transformation matrix and the covariance matrix in the general case, we will therefore try to decompose the covariance matrix into the product of rotation and scaling matrices.

In an earlier article we saw that a linear transformation matrix is completely defined by its eigenvectors and eigenvalues. Applied to the covariance matrix, this means that:

(12)

where is an eigenvector of , and is the corresponding eigenvalue.

Since the eigenvalues are scalars, when thinking about them as linear transformations they can only represent a scaling of . Therefore, a first important conclusion is that the eigenvalues of the covariance matrix represent the spread of the data in the direction of its largest variance. In other words; the eigenvectors of the covariance matrix always point in the direction of the largest variance of the data. This observation forms the base of Principal Component Analysis and is illustrated by figure 7.

The largest eigenvector of the covariance matrix, shown in green, points in the direction of the largest variance of the original data. The second eigenvector, shown in magenta, is always orthogonal to the first. The eigenvalues represent the size of the arrows and thus correspond to the magnitude of the spread in these directions. The covariance matrix represents the horizontal and vertical spread of the data by its (diagonal) variance components, and the rotation angle by its (off-diagonal) covariance components. If the data would not have been rotated, then the eigenvectors would be axis-aligned, the covariance would be zero, and the variances would directly relate to the eigenvalues.

A second important conclusion that can be drawn from equation (12), is the fact that the covariance matrix can be seen as a linear transformation matrix that maps its eigenvectors upon a scaled version of itself, where the scale corresponds to the eigenvalues. In the following paragraphs, we will show why these two conclusions are true, and how we can relate arbitrary linear transformations on our original data to the covariance matrix of the resulting data.

Equation (12) holds for each eigenvector-eigenvalue pair of matrix . In the 2D case, we obtain two eigenvectors and two eigenvalues. The system of two equations defined by equation (12) can be represented efficiently using matrix notation:

(13)

where is the matrix whose columns are the eigenvectors of and is the diagonal matrix whose non-zero elements are the corresponding eigenvalues.

This means that we can represent the covariance matrix as a function of its eigenvectors and eigenvalues:

(14)

Equation (14) is called the eigendecomposition of the covariance matrix and can be obtained using a Singular Value Decomposition algorithm. Whereas the eigenvectors represent the directions of the largest variance of the data, the eigenvalues represent the magnitude of this variance in those directions. In other words, represents a rotation matrix, while represents a scaling matrix. The covariance matrix can thus be decomposed further as:

(15)

where is a rotation matrix and is a scaling matrix.

In equation (5) we defined a linear transformation . Since is a diagonal scaling matrix, . Furthermore, since is an orthogonal matrix, . Therefore, . The covariance matrix can thus be written as:

(16)

In other words, if we apply the linear transformation defined by to the original white data shown by figure 5, we obtain the rotated and scaled data with covariance matrix . This is illustrated by figure 8:

The colored arrows in figure 8 represent the eigenvectors. The largest eigenvector, i.e. the eigenvector with the largest corresponding eigenvalue, always points in the direction of the largest variance of the data and thereby defines its orientation. Subsequent eigenvectors are always orthogonal to the largest eigenvector due to the orthogonality of rotation matrices.

## Conclusion

In this article we showed that the covariance matrix of observed data is directly related to a linear transformation of white, uncorrelated data. This linear transformation is completely defined by the eigenvectors and eigenvalues of the data. While the eigenvectors represent the rotation matrix, the eigenvalues correspond to the square of the scaling factor in each dimension.

Great article thank you

The covariance matrix is symmetric. Hence we can find a basis of orthonormal eigenvectors and then $\Sigma=VL V^T$.

From computational point of view it is much simpler to find $V^T$ than $V^{-1}$.

Very true, Alex, and thanks for your comment! This is also written in the article: “Furthermore, since R is an orthogonal matrix, R^{-1} = R^T”. But you are right that I only mention this near the end of the article, mostly because it is easier to develop an intuitive understanding of the first part of the article by considering R^{-1} instead of R^T.

Great post! I had a couple questions:

1) The data D doesn’t need to be Gaussian does it?

2) Is [9] reversed (should D be on the left)?

Hi Brian:

1) Indeed the data D does not need to be Gaussian for the theory to hold, I should probably have made that more clear in the article. However, talking about covariance matrices often does not have much meaning in highly non-Gaussian data.

2) That depends on whether D is a row vector or a column vector I suppose. In this case, if each column of D is a data entry, then R*D = (D^t*R)^t

Thank you for this great post! But let me please correct one fundamental mistake that you made. The square root of covariance matrix M is not equal to R * S. The square root of M equals R * S * R’, where R’ is transposed R. Proof: (R * S * R’) * (R * S * R’) = R * S * R’ * R * S * R’ = R * S * S * R’ = T * T’ = M. And, of course, T is not a symmetric matrix (in your post T = T’, which is wrong).

Thanks a lot for noticing! You are right indeed, I will get back about this soon (don’t really have time right now).

Edit: I just fixed this mistake. Sorry for the long delay, I didn’t find the time before. Thanks a lot for your feedback!

Very Useful Article What I feel needs to be included is the interpretation of the action of the covariance matrix as a linear operator. For example, the eigen vectors of the covariance matrix form the principal components in PCA. So, basically , the covariance matrix takes an input data point ( vector ) and if it resembles the data points from which the operator was obtained, it keeps it invariant ( upto scaling ). Is there a better way to interpret the eigenvectors of covariance matrix ?