Computer vision for dummies https://www.visiondummy.com A blog about intelligent algorithms, machine learning, computer vision, datamining and more. Tue, 04 May 2021 14:17:31 +0000 en-US hourly 1 https://wordpress.org/?v=3.8.39 Hybrid deep learning for modeling driving behavior from sensor data https://www.visiondummy.com/2017/09/hybrid-deep-learning/ https://www.visiondummy.com/2017/09/hybrid-deep-learning/#comments Mon, 25 Sep 2017 17:38:57 +0000 http://www.visiondummy.com/?p=1013 Usage based insurance solutions where smartphone sensor data is used to analyze the driver’s behavior are becoming prevalent these days. However, a major shortcoming of most solutions on the market today, is the fact that trips where the user was a passenger, e.g. taxi trips, get included in the user’s risk profile. At Sentiance, we [...]

The post Hybrid deep learning for modeling driving behavior from sensor data appeared first on Computer vision for dummies.

]]>
Learning Driver DNA

Usage based insurance solutions where smartphone sensor data is used to analyze the driver’s behavior are becoming prevalent these days. However, a major shortcoming of most solutions on the market today, is the fact that trips where the user was a passenger, e.g. taxi trips, get included in the user’s risk profile.

Accelerometer and gyroscope data can be used to estimate a user's driving behavior.

Accelerometer and gyroscope data can be used to estimate a user’s driving behavior.

At Sentiance, we developed a deep learning based solution that gradually learns to model the user’s driving behavior, and that can be used to detect and remove those trips where the user was actually a passenger.

The input to our deep neural network is raw accelerometer and gyroscope data. A convolutional neural network is used to learn low-level features that describe the user specific driving characteristics, whereas an LSTM layer is stacked on top of the convolutional block to model long-term temporal dependencies within the sensor data.

The goal of this project was to let the neural network learn a feature space in which trips with a similar driving behavior appear close to each other, while the cosine distance between trips with distinct kinds of driving behavior should be large. Since it is difficult and expensive to gather large amounts of labelled data in this case, we applied a clever transfer learning trick; The network was initially trained with thousands of userIDs as output labels, and was forced to learn to classify which trip originated from which user. Once convergence was reached, we chopped of the top soft-max layer of the network, and used the dense layer directly as our feature space.

Transfer learning was used by training the network to perform a related task, and then using one of the dense layers directly as our feature space for the real task at hand.

Transfer learning was used by training the network to perform a related task, and then using one of the dense layers directly as our feature space for the real task at hand.

Basically, this allowed the network to learn a metric space that we can then use for clustering or outlier detection, even on trips for new users that never appeared in our training data.

Check out our latest blog post at Sentiance that describes the model and approach in details: Applying deep learning to distinguish drivers from passengers using sensor data

The post Hybrid deep learning for modeling driving behavior from sensor data appeared first on Computer vision for dummies.

]]>
https://www.visiondummy.com/2017/09/hybrid-deep-learning/feed/ 0
Deep learning for long-term predictions https://www.visiondummy.com/2017/04/deep-learning-for-long-term-prediction/ https://www.visiondummy.com/2017/04/deep-learning-for-long-term-prediction/#comments Wed, 26 Apr 2017 10:40:17 +0000 http://www.visiondummy.com/?p=945 At Sentiance, we use machine learning to extract intelligence from smartphone sensor data such as accelerometer, gyroscope and location. We’ve been doing this for quite a while now, and are very proud on our state-of-the-art results regarding sensor based activity detection, map matching, driving behavior, venue mapping and more. The obvious next step is to [...]

The post Deep learning for long-term predictions appeared first on Computer vision for dummies.

]]>
Detecting versus predicting

At Sentiance, we use machine learning to extract intelligence from smartphone sensor data such as accelerometer, gyroscope and location. We’ve been doing this for quite a while now, and are very proud on our state-of-the-art results regarding sensor based activity detection, map matching, driving behavior, venue mapping and more.

The obvious next step is to go from simply detecting what you are doing, to predicting what you will be doing in the future. Knowing your near-term future allows us the explain the intent of your current situation. For example, if we detect you are currently running, and we can predict that you will be on a train followed by a visit to your work location, then we can immediately explain why you are running; you’re obviously not just being sportive today!

Deep learning to the rescue

To be able to come up with long-term predictions, we started out with a simple Markov Chain like approach and ended up turning to deep learning. We trained a Long Short-Term Memory (LSTM) recurrent neural network on several thousands of event timelines. The network learns to encode general human behavior and surprisingly is able to quickly adapt to specific user habits. The following figure illustrates the architecture of our deep learning pipeline:

LSTM architecture for event prediction

LSTM architecture used for event prediction. The input consists of a sequence of your last 128 events, while the output is a prediction of your next event, together with a duration estimate of the current event. The network is implemented using TensorFlow

I’m extremely proud of the results we achieve, mainly because of the following:

  1. The network is often able to predict events that a human observer would not even think of. This sometimes feels like magic, even for the geekiest of our data scientists – and we all like magic!
  2. We started out with simpler Bayesian models (Markov based) and gradually moved to more advanced solutions
  3. Deep learning in this case actually solves a real problem instead of just following the hype
  4. Our deep learning pipeline actually runs in production for millions of users. We put a lot of effort in making it scalable, reproducible and maintainable.

Check out our technical blog post that outlines the details on how we trained and tested our models, and how they actually work: http://www.sentiance.com/2017/04/25/predictive-analytics-applying-deep-learning-on-mobile-sensor-data/

Check out the videos of our prediction results

Check out the videos of our prediction results (Links to the original blog post)

The post Deep learning for long-term predictions appeared first on Computer vision for dummies.

]]>
https://www.visiondummy.com/2017/04/deep-learning-for-long-term-prediction/feed/ 1
Feature extraction using PCA https://www.visiondummy.com/2014/05/feature-extraction-using-pca/ https://www.visiondummy.com/2014/05/feature-extraction-using-pca/#comments Fri, 16 May 2014 09:33:27 +0000 http://www.visiondummy.com/?p=328 In this article, we discuss how Principal Component Analysis (PCA) works, and how it can be used as a dimensionality reduction technique for classification problems. At the end of this article, Matlab source code is provided for demonstration purposes. In an earlier article, we discussed the so called Curse of Dimensionality and showed that classifiers [...]

The post Feature extraction using PCA appeared first on Computer vision for dummies.

]]>
Introduction

In this article, we discuss how Principal Component Analysis (PCA) works, and how it can be used as a dimensionality reduction technique for classification problems. At the end of this article, Matlab source code is provided for demonstration purposes.

In an earlier article, we discussed the so called Curse of Dimensionality and showed that classifiers tend to overfit the training data in high dimensional spaces. The question then rises which features should be preferred and which ones should be removed from a high dimensional feature vector.

If all features in this feature vector were statistically independent, one could simply eliminate the least discriminative features from this vector. The least discriminative features can be found by various greedy feature selection approaches. However, in practice, many features depend on each other or on an underlying unknown variable. A single feature could therefore represent a combination of multiple types of information by a single value. Removing such a feature would remove more information than needed. In the next paragraphs, we introduce PCA as a feature extraction solution to this problem, and introduce its inner workings from two different perspectives.

PCA as a decorrelation method

More often than not, features are correlated. As an example, consider the case where we want to use the red, green and blue components of each pixel in an image to classify the image (e.g. detect dogs versus cats). Image sensors that are most sensitive to red light also capture some blue and green light. Similarly, sensors that are most sensitive to blue and green light also exhibit a certain degree of sensitivity to red light. As a result, the R, G, B components of a pixel are statistically correlated. Therefore, simply eliminating the R component from the feature vector, also implicitly removes information about the G and B channels. In other words, before eliminating features, we would like to transform the complete feature space such that the underlying uncorrelated components are obtained.

Consider the following example of a 2D feature space:

2D Correlated data

Figure 1 2D Correlated data with eigenvectors shown in color.

The features x and y, illustrated by figure 1, are clearly correlated. In fact, their covariance matrix is:

    \begin{equation*} \Sigma = \begin{bmatrix} 16.87 & 14.94 \\[0.3em] 14.94 & 17.27 \\[0.3em] \end{bmatrix} \end{equation*}

In an earlier article we discussed the geometric interpretation of the covariance matrix. We saw that the covariance matrix can be decomposed as a sequence of rotation and scaling operations on white, uncorrelated data, where the rotation matrix is defined by the eigenvectors of this covariance matrix. Therefore, intuitively, it is easy to see that the data D shown in figure 1 can be decorrelated by rotating each data point such that the eigenvectors V become the new reference axes:

(1)   \begin{equation*} D' = V \, D \end{equation*}

2D Uncorrelated data with eigenvectors shown in color.

Figure 2.2D Uncorrelated data with eigenvectors shown in color.

The covariance matrix of the resulting data is now diagonal, meaning that the new axes are uncorrelated:

    \begin{equation*} \Sigma' = \begin{bmatrix} 1.06 & 0.0 \\[0.3em] 0.0 & 16.0 \\[0.3em] \end{bmatrix} \end{equation*}

In fact, the original data used in this example and shown by figure 1 was generated by linearly combining two 1D Gaussian feature vectors x_1 \sim N(0, 1) and x_2 \sim N(0, 1) as follows:

    \begin{align*} x &= x_2 + x_1\\ y &= x_2 - x_1 \end{align*}

Since the features x and y are linear combinations of some unknown underlying components x_1 and x_2, directly eliminating either x or y as a feature would have removed some information from both x_1 and x_2. Instead, rotating the data by the eigenvectors of its covariance matrix, allowed us to directly recover the independent components x_1 and x_2 (up to a scaling factor). This can be seen as follows: The eigenvectors of the covariance matrix of the original data are (each column represents an eigenvector):

    \begin{equation*} V = \begin{bmatrix} -0.7071 & 0.7071 \\[0.3em] 0.7071 & 0.7071 \\[0.3em] \end{bmatrix} \end{equation*}

The first thing to notice is that V in this case is a rotation matrix, corresponding to a rotation of 45 degrees (cos(45)=0.7071), which indeed is evident from figure 1. Secondly, treating V as a linear transformation matrix results in a new coordinate system, such that each new feature x' and y' is expressed as a linear combination of the original features x and y:

(2)   \begin{align*} x' &= -0.7071 \, x + 0.7071 y \\ &= -0.7071 \, (x_2 + x_1) + 0.7071 \, (x_2 - x_1) \\ &= -1.4142 \, x_1 \end{align*}

and

(3)   \begin{align*} y' &= 0.7071 \, x + 0.7071 y \\ &= 0.7071 \, (x_2 + x_1) + 0.7071 \, (x_2 - x_1) y \\ &= 1.4142 \, x_2 \end{align*}

In other words, decorrelation of the feature space corresponds to the recovery of the unknown, uncorrelated components x_1 and y_1 of the data (up to an unknown scaling factor if the transformation matrix was not orthogonal). Once these components have been recovered, it is easy to reduce the dimensionality of the feature space by simply eliminating either x_1 or x_2.

In the above example we started with a two-dimensional problem. If we would like to reduce the dimensionality, the question remains whether to eliminate x_1 (and thus x') or y_1 (and thus y'). Although this choice could depend on many factors such as the separability of the data in case of classification problems, PCA simply assumes that the most interesting feature is the one with the largest variance or spread. This assumption is based on an information theoretic point of view, since the dimension with the largest variance corresponds to the dimension with the largest entropy and thus encodes the most information. The smallest eigenvectors will often simply represent noise components, whereas the largest eigenvectors often correspond to the principal components that define the data.

Dimensionality reduction by means of PCA is then accomplished simply by projecting the data onto the largest eigenvectors of its covariance matrix. For the above example, the resulting 1D feature space is illustrated by figure 3:

2D data projected onto its largest eigenvector

Figure 3. PCA: 2D data projected onto its largest eigenvector.

Obivously, the above example easily generalizes to higher dimensional feature spaces. For instance, in the three-dimensional case, we can either project the data onto the plane defined by the two largest eigenvectors to obtain a 2D feature space, or we can project it onto the largest eigenvector to obtain a 1D feature space. This is illustrated by figure 4:

Principal Component Analysis for 3D data

Figure 4. 3D data projected onto a 2D or 1D linear subspace by means of Principal Component Analysis.

In general, PCA allows us to obtain a linear M-dimensional subspace of the original N-dimensional data, where M \le N. Furthermore, if the unknown, uncorrelated components are Gaussian distributed, then PCA actually acts as an independent component analysis since uncorrelated Gaussian variables are statistically independent. However, if the underlying components are not normally distributed, PCA merely generates decorrelated variables which are not necessarily statistically independent. In this case, non-linear dimensionality reduction algorithms might be a better choice.

PCA as an orthogonal regression method

In the above discussion, we started with the goal of obtaining independent components (or at least uncorrelated components if the data is not normally distributed) to reduce the dimensionality of the feature space. We found that these so called ‘principal components’ are obtained by the eigendecomposition of the covariance matrix of our data. The dimensionality is then reduced by projecting the data onto the largest eigenvectors.

Now let’s forget about our wish to find uncorrelated components for a while. Instead, we will now try to reduce the dimensionality by finding a linear subspace of the original feature space onto which we can project our data such that the projection error is minimized. In the 2D case, this means that we try to find a vector such that projecting the data onto this vector corresponds to a projection error that is lower than the projection error that would be obtained when projecting the data onto any other possible vector. The question is then how to find this optimal vector.

Consider the example shown by figure 5. Three different projection vectors are shown, together with the resulting 1D data. In the next paragraphs, we will discuss how to determine which projection vector minimizes the projection error. Before searching for a vector that minimizes the projection error, we have to define this error function.

Dimensionality reduction by projection onto a linear subspace

Figure 5 Dimensionality reduction by projection onto a linear subspace

A well known method to fit a line to 2D data is least squares regression. Given the independent variable x and the dependent variable y, the least squares regressor corresponds to the line f(x) = ax + b, such that the sum of the squared residual errors \sum_{i=0}^N (f(x_i) - y_i)^2 is minimized. In other words, if x is treated as the independent variable, then the obtained regressor f(x) is a linear function that can predict the dependent variable y such that the squared error is minimal. The resulting model f(x) is illustrated by the blue line in figure 5, and the error that is minimized is illustrated in figure 6.

Linear regression with x as the independent variable

Figure 6. Linear regression where x is the independent variable and y is the dependent variable, corresponds to minimizing the vertical projection error.

However, in the context of feature extraction, one might wonder why we would define feature x as the independent variable and feature y as the dependent variable. In fact, we could easily define y as the independent variable and find a linear function f(y) that predicts the dependent variable x, such that \sum_{i=0}^N (f(y_i) - x_i)^2 is minimized. This corresponds to minimization of the horizontal projection error and results in a different linear model as shown by figure 7:

Linear regression with y as the independent variable

Figure 7. Linear regression where y is the independent variable and x is the dependent variable, corresponds to minimizing the horizontal projection error.

Clearly, the choice of independent and dependent variables changes the resulting model, making ordinary least squares regression an asymmetric regressor. The reason for this is that least squares regression assumes the independent variable to be noise-free, whereas the dependent variable is assumed to be noisy. However, in the case of classification, all features are usually noisy observations such that neither x or y should be treated as independent. In fact, we would like to obtain a model f(x,y) that minimizes both the horizontal and the vertical projection error simultaneously. This corresponds to finding a model such that the orthogonal projection error is minimized as shown by figure 8.

Linear regression where both variables are independent

Figure 8. Linear regression where both variables are independent corresponds to minimizing the orthogonal projection error.

The resulting regression is called Total Least Squares regression or orthogonal regression, and assumes that both variables are imperfect observations. An interesting observation is now that the obtained vector, representing the projection direction that minimizes the orthogonal projection error, corresponds the the largest principal component of the data:

Orthogonal regression based on eigendecomposition

Figure 9. The vector which the data can be projected unto with minimal orthogonal error corresponds to the largest eigenvector of the covariance matrix of the data.

In other words, if we want to reduce the dimensionality by projecting the original data onto a vector such that the squared projection error is minimized in all directions, we can simply project the data onto the largest eigenvectors. This is exactly what we called Principal Component Analysis in the previous section, where we showed that such projection also decorrelates the feature space.

A practical PCA application: Eigenfaces

Although the above examples are limited to two or three dimensions for visualization purposes, dimensionality reduction usually becomes important when the number of features is not negligible compared to the number of training samples. As an example, suppose we would like to perform face recognition, i.e. determine the identity of the person depicted in an image, based on a training dataset of labeled face images. One approach might be to treat the brightness of each pixel of the image as a feature. If the input images are of size 32×32 pixels, this means that the feature vector contains 1024 feature values. Classifying a new face image can then be done by calculating the Euclidean distance between this 1024-dimensional vector, and the feature vectors of the people in our training dataset. The smallest distance then tells us which person we are looking at.

However, operating in a 1024-dimensional space becomes problematic if we only have a few hundred training samples. Furthermore, Euclidean distances behave strangely in high dimensional spaces as discussed in an earlier article. Therefore, we could use PCA to reduce the dimensionality of the feature space by calculating the eigenvectors of the covariance matrix of the set of 1024-dimensional feature vectors, and then projecting each feature vector onto the largest eigenvectors.

Since the eigenvector of 2D data is 2-dimensional, and an eigenvector of 3D data is 3-dimensional, the eigenvectors of 1024-dimensional data is 1024-dimensional. In other words, we could reshape each of the 1024-dimensional eigenvectors to a 32×32 image for visualization purposes. Figure 10 shows the first four eigenvectors obtained by eigendecomposition of the Cambridge face dataset:

Eigenfaces

Figure 10. The four largest eigenvectors, reshaped to images, resulting in so called EigenFaces. (source: https://nl.wikipedia.org/wiki/Eigenface)

Each 1024-dimensional feature vector (and thus each face) can now be projected onto the N largest eigenvectors, and can be represented as a linear combination of these eigenfaces. The weights of these linear combinations determine the identity of the person. Since the largest eigenvectors represent the largest variance in the data, these eigenfaces describe the most informative image regions (eyes, noise, mouth, etc.). By only considering the first N (e.g. N=70) eigenvectors, the dimensionality of the feature space is greatly reduced.

The remaining question is now how many eigenfaces should be used, or in the general case; how many eigenvectors should be kept. Removing too many eigenvectors might remove important information from the feature space, whereas eliminating too few eigenvectors leaves us with the curse of dimensionality. Regrettably there is no straight answer to this problem. Although cross-validation techniques can be used to obtain an estimate of this hyperparameter, choosing the optimal number of dimensions remains a problem that is mostly solved in an empirical (an academic term that means not much more than ‘trial-and-error’) manner. Note that it is often useful to check how much (as a percentage) of the variance of the original data is kept while eliminating eigenvectors. This is done by dividing the sum of the kept eigenvalues by the sum of all eigenvalues.

The PCA recipe

Based on the previous sections, we can now list the simple recipe used to apply PCA for feature extraction:

1) Center the data

In an earlier article, we showed that the covariance matrix can be written as a sequence of linear operations (scaling and rotations). The eigendecomposition extracts these transformation matrices: the eigenvectors represent the rotation matrix, while the eigenvalues represent the scaling factors. However, the covariance matrix does not contain any information related to the translation of the data. Indeed, to represent translation, an affine transformation would be needed instead of a linear transformation.

Therefore, before applying PCA to rotate the data in order to obtain uncorrelated axes, any existing shift needs to be countered by subtracting the mean of the data from each data point. This simply corresponds to centering the data such that its average becomes zero.

2) Normalize the data

The eigenvectors of the covariance matrix point in the direction of the largest variance of the data. However, variance is an absolute number, not a relative one. This means that the variance of data, measured in centimeters (or inches) will be much larger than the variance of the same data when measured in meters (or feet). Consider the example where one feature represents the length of an object in meters, while the second feature represents the width of the object in centimeters. The largest variance, and thus the largest eigenvector, will implicitly be defined by the first feature if the data is not normalized.

To avoid this scale-dependent nature of PCA, it is useful to normalize the data by dividing each feature by its standard deviation. This is especially important if different features correspond to different metrics.

3) Calculate the eigendecomposition

Since the data will be projected onto the largest eigenvectors to reduce the dimensionality, the eigendecomposition needs to be obtained. One of the most widely used methods to efficiently calculate the eigendecomposition is Singular Value Decomposition (SVD).

4) Project the data

To reduce the dimensionality, the data is simply projected onto the largest eigenvectors. Let V be the matrix whose columns contain the largest eigenvectors and let D be the original data whose columns contain the different observations. Then the projected data D' is obtained as D' = V^{\intercal} \, D. We can either choose the number of remaining dimensions, i.e. the columns of V, directly, or we can define the amount of variance of the original data that needs to kept while eliminating eigenvectors. If only N eigenvectors are kept, and e_1...e_N represent the corresponding eigenvalues, then the amount of variance that remains after projecting the original d-dimensional data can be calculated as:

(4)   \begin{equation*} s = \frac{\sum_{i=0}^N e_i}{\sum_{j=0}^d e_j} \end{equation*}

PCA pitfalls

In the above discussion, several assumptions have been made. In the first section, we discussed how PCA decorrelates the data. In fact, we started the discussion by expressing our desire to recover the unknown, underlying independent components of the observed features. We then assumed that our data was normally distributed, such that statistical independence simply corresponds to the lack of a linear correlation. Indeed, PCA allows us to decorrelate the data, thereby recovering the independent components in case of Gaussianity. However, it is important to note that decorrelation only corresponds to statistical independency in the Gaussian case. Consider the data obtained by sampling half a period of y=sin(x):

sinx

Figure 11 Uncorrelated data is only statistically independent if normally distributed. In this example a clear non-linear dependency still exists: y=sin(x).

Although the above data is clearly uncorrelated (on average, the y-value increases as much as it decreases when the x-value goes up) and therefore corresponds to a diagonal covariance matrix, there still is a clear non-linear dependency between both variables.

In general, PCA only uncorrelates the data but does not remove statistical dependencies. If the underlying components are known to be non-Gaussian, techniques such as ICA could be more interesting. On the other hand, if non-linearities clearly exist, dimensionality reduction techniques such as non-linear PCA can be used. However, keep in mind that these methods are prone to overfitting themselves, since more parameters are to be estimated based on the same amount of training data.

A second assumption that was made in this article, is that the most discriminative information is captured by the largest variance in the feature space. Since the direction of the largest variance encodes the most information this is likely to be true. However, there are cases where the discriminative information actually resides in the directions of the smallest variance, such that PCA could greatly hurt classification performance. As an example, consider the two cases of figure 12, where we reduce the 2D feature space to a 1D representation:

PCA might hurt classification performance

Figure 12. In the first case, PCA would hurt classification performance because the data becomes linearly unseparable. This happens when the most discriminative information resides in the smaller eigenvectors.

If the most discriminative information is contained in the smaller eigenvectors, applying PCA might actually worsen the Curse of Dimensionality because now a more complicated classification model (e.g. non-linear classifier) is needed to classify the lower dimensional problem. In this case, other dimensionality reduction methods might be of interest, such as Linear Discriminant Analysis (LDA) which tries to find the projection vector that optimally separates the two classes.

Source Code

The following code snippet shows how to perform principal component analysis for dimensionality reduction in Matlab:
Matlab source code

Conclusion

In this article, we discussed the advantages of PCA for feature extraction and dimensionality reduction from two different points of view. The first point of view explained how PCA allows us to decorrelate the feature space, whereas the second point of view showed that PCA actually corresponds to orthogonal regression.

Furthermore, we briefly introduced Eigenfaces as a well known example of PCA based feature extraction, and we covered some of the most important disadvantages of Principal Component Analysis.

If you’re new to this blog, don’t forget to subscribe, or follow me on twitter!

JOIN MY NEWSLETTER
Receive my newsletter to get notified when new articles and code snippets become available on my blog!
We all hate spam. Your email address will not be sold or shared with anyone else.

The post Feature extraction using PCA appeared first on Computer vision for dummies.

]]>
https://www.visiondummy.com/2014/05/feature-extraction-using-pca/feed/ 14
A geometric interpretation of the covariance matrix https://www.visiondummy.com/2014/04/geometric-interpretation-covariance-matrix/ https://www.visiondummy.com/2014/04/geometric-interpretation-covariance-matrix/#comments Thu, 24 Apr 2014 11:09:38 +0000 http://www.visiondummy.com/?p=440 In this article, we provide an intuitive, geometric interpretation of the covariance matrix, by exploring the relation between linear transformations and the resulting data covariance. Most textbooks explain the shape of data based on the concept of covariance matrices. Instead, we take a backwards approach and explain the concept of covariance matrices based on the [...]

The post A geometric interpretation of the covariance matrix appeared first on Computer vision for dummies.

]]>
Introduction

In this article, we provide an intuitive, geometric interpretation of the covariance matrix, by exploring the relation between linear transformations and the resulting data covariance. Most textbooks explain the shape of data based on the concept of covariance matrices. Instead, we take a backwards approach and explain the concept of covariance matrices based on the shape of data.


In a previous article, we discussed the concept of variance, and provided a derivation and proof of the well known formula to estimate the sample variance. Figure 1 was used in this article to show that the standard deviation, as the square root of the variance, provides a measure of how much the data is spread across the feature space.

Normal distribution

Figure 1. Gaussian density function. For normally distributed data, 68% of the samples fall within the interval defined by the mean plus and minus the standard deviation.

We showed that an unbiased estimator of the sample variance can be obtained by:

(1)   \begin{align*} \sigma_x^2 &= \frac{1}{N-1} \sum_{i=1}^N (x_i - \mu)^2\\ &= \mathbb{E}[ (x - \mathbb{E}(x)) (x - \mathbb{E}(x))]\\ &= \sigma(x,x) \end{align*}

However, variance can only be used to explain the spread of the data in the directions parallel to the axes of the feature space. Consider the 2D feature space shown by figure 2:

Data with a positive covariance

Figure 2. The diagnoal spread of the data is captured by the covariance.

For this data, we could calculate the variance \sigma(x,x) in the x-direction and the variance \sigma(y,y) in the y-direction. However, the horizontal spread and the vertical spread of the data does not explain the clear diagonal correlation. Figure 2 clearly shows that on average, if the x-value of a data point increases, then also the y-value increases, resulting in a positive correlation. This correlation can be captured by extending the notion of variance to what is called the ‘covariance’ of the data:

(2)   \begin{equation*} \sigma(x,y) = \mathbb{E}[ (x - \mathbb{E}(x)) (y - \mathbb{E}(y))] \end{equation*}

For 2D data, we thus obtain \sigma(x,x), \sigma(y,y), \sigma(x,y) and \sigma(y,x). These four values can be summarized in a matrix, called the covariance matrix:

(3)   \begin{equation*} \Sigma = \begin{bmatrix} \sigma(x,x) & \sigma(x,y) \\[0.3em] \sigma(y,x) & \sigma(y,y) \\[0.3em] \end{bmatrix} \end{equation*}

If x is positively correlated with y, y is also positively correlated with x. In other words, we can state that \sigma(x,y) = \sigma(y,x). Therefore, the covariance matrix is always a symmetric matrix with the variances on its diagonal and the covariances off-diagonal. Two-dimensional normally distributed data is explained completely by its mean and its 2\times 2 covariance matrix. Similarly, a 3 \times 3 covariance matrix is used to capture the spread of three-dimensional data, and a N \times N covariance matrix captures the spread of N-dimensional data.

Figure 3 illustrates how the overall shape of the data defines the covariance matrix:

The spread of the data is defined by its covariance matrix

Figure 3. The covariance matrix defines the shape of the data. Diagonal spread is captured by the covariance, while axis-aligned spread is captured by the variance.

Eigendecomposition of a covariance matrix

In the next section, we will discuss how the covariance matrix can be interpreted as a linear operator that transforms white data into the data we observed. However, before diving into the technical details, it is important to gain an intuitive understanding of how eigenvectors and eigenvalues uniquely define the covariance matrix, and therefore the shape of our data.

As we saw in figure 3, the covariance matrix defines both the spread (variance), and the orientation (covariance) of our data. So, if we would like to represent the covariance matrix with a vector and its magnitude, we should simply try to find the vector that points into the direction of the largest spread of the data, and whose magnitude equals the spread (variance) in this direction.

If we define this vector as \vec{v}, then the projection of our data D onto this vector is obtained as \vec{v}^{\intercal} D, and the variance of the projected data is \vec{v}^{\intercal} \Sigma \vec{v}. Since we are looking for the vector \vec{v} that points into the direction of the largest variance, we should choose its components such that the covariance matrix \vec{v}^{\intercal} \Sigma \vec{v} of the projected data is as large as possible. Maximizing any function of the form \vec{v}^{\intercal} \Sigma \vec{v} with respect to \vec{v}, where \vec{v} is a normalized unit vector, can be formulated as a so called Rayleigh Quotient. The maximum of such a Rayleigh Quotient is obtained by setting \vec{v} equal to the largest eigenvector of matrix \Sigma.

In other words, the largest eigenvector of the covariance matrix always points into the direction of the largest variance of the data, and the magnitude of this vector equals the corresponding eigenvalue. The second largest eigenvector is always orthogonal to the largest eigenvector, and points into the direction of the second largest spread of the data.

Now let’s have a look at some examples. In an earlier article we saw that a linear transformation matrix T is completely defined by its eigenvectors and eigenvalues. Applied to the covariance matrix, this means that:

(4)   \begin{equation*}  \Sigma \vec{v} = \lambda \vec{v} \end{equation*}

where \vec{v} is an eigenvector of \Sigma, and \lambda is the corresponding eigenvalue.

If the covariance matrix of our data is a diagonal matrix, such that the covariances are zero, then this means that the variances must be equal to the eigenvalues \lambda. This is illustrated by figure 4, where the eigenvectors are shown in green and magenta, and where the eigenvalues clearly equal the variance components of the covariance matrix.

Eigenvectors of a covariance matrix

Figure 4. Eigenvectors of a covariance matrix

However, if the covariance matrix is not diagonal, such that the covariances are not zero, then the situation is a little more complicated. The eigenvalues still represent the variance magnitude in the direction of the largest spread of the data, and the variance components of the covariance matrix still represent the variance magnitude in the direction of the x-axis and y-axis. But since the data is not axis aligned, these values are not the same anymore as shown by figure 5.

Eigenvectors with covariance

Figure 5. Eigenvalues versus variance

By comparing figure 5 with figure 4, it becomes clear that the eigenvalues represent the variance of the data along the eigenvector directions, whereas the variance components of the covariance matrix represent the spread along the axes. If there are no covariances, then both values are equal.

Covariance matrix as a linear transformation

Now let’s forget about covariance matrices for a moment. Each of the examples in figure 3 can simply be considered to be a linearly transformed instance of figure 6:

White data

Figure 6. Data with unit covariance matrix is called white data.

Let the data shown by figure 6 be D, then each of the examples shown by figure 3 can be obtained by linearly transforming D:

(5)   \begin{equation*} D' = T \, D \end{equation*}

where T is a transformation matrix consisting of a rotation matrix R and a scaling matrix S:

(6)   \begin{equation*} T = R \, S. \end{equation*}

These matrices are defined as:

(7)   \begin{equation*} R = \begin{bmatrix} \cos(\theta) & -\sin(\theta) \\[0.3em] \sin(\theta) & \cos(\theta) \end{bmatrix} \end{equation*}

where \theta is the rotation angle, and:

(8)   \begin{equation*} S = \begin{bmatrix} s_x & 0 \\[0.3em] 0 & s_y \end{bmatrix} \end{equation*}

where s_x and s_y are the scaling factors in the x direction and the y direction respectively.

In the following paragraphs, we will discuss the relation between the covariance matrix \Sigma, and the linear transformation matrix T = R\, S.

Let’s start with unscaled (scale equals 1) and unrotated data. In statistics this is often refered to as ‘white data’ because its samples are drawn from a standard normal distribution and therefore correspond to white (uncorrelated) noise:

Whitened data

Figure 7. White data is data with a unit covariance matrix.

The covariance matrix of this ‘white’ data equals the identity matrix, such that the variances and standard deviations equal 1 and the covariance equals zero:

(9)   \begin{equation*} \Sigma = \begin{bmatrix} \sigma_x^2 & 0 \\[0.3em] 0 & \sigma_y^2 \\ \end{bmatrix} = \begin{bmatrix} 1 & 0 \\[0.3em] 0 & 1 \\ \end{bmatrix} \end{equation*}

Now let’s scale the data in the x-direction with a factor 4:

(10)   \begin{equation*} D' = \begin{bmatrix} 4 & 0 \\[0.3em] 0 & 1 \\ \end{bmatrix} \, D \end{equation*}

The data D' now looks as follows:

Data with variance in the x-direction

Figure 8. Variance in the x-direction results in a horizontal scaling.

The covariance matrix \Sigma' of D' is now:

(11)   \begin{equation*} \Sigma' = \begin{bmatrix} \sigma_x^2 & 0 \\[0.3em] 0 & \sigma_y^2 \\ \end{bmatrix} = \begin{bmatrix} 16 & 0 \\[0.3em] 0 & 1 \\ \end{bmatrix} \end{equation*}

Thus, the covariance matrix \Sigma' of the resulting data D' is related to the linear transformation T that is applied to the original data as follows: D' = T \, D, where

(12)   \begin{equation*} T = \sqrt{\Sigma'} = \begin{bmatrix} 4 & 0 \\[0.3em] 0 & 1 \\ \end{bmatrix}. \end{equation*}

However, although equation (12) holds when the data is scaled in the x and y direction, the question rises if it also holds when a rotation is applied. To investigate the relation between the linear transformation matrix T and the covariance matrix \Sigma' in the general case, we will therefore try to decompose the covariance matrix into the product of rotation and scaling matrices.

As we saw earlier, we can represent the covariance matrix by its eigenvectors and eigenvalues:

(13)   \begin{equation*}  \Sigma \vec{v} = \lambda \vec{v} \end{equation*}

where \vec{v} is an eigenvector of \Sigma, and \lambda is the corresponding eigenvalue.

Equation (13) holds for each eigenvector-eigenvalue pair of matrix \Sigma. In the 2D case, we obtain two eigenvectors and two eigenvalues. The system of two equations defined by equation (13) can be represented efficiently using matrix notation:

(14)   \begin{equation*}  \Sigma \, V = V \, L \end{equation*}

where V is the matrix whose columns are the eigenvectors of \Sigma and L is the diagonal matrix whose non-zero elements are the corresponding eigenvalues.

This means that we can represent the covariance matrix as a function of its eigenvectors and eigenvalues:

(15)   \begin{equation*}  \Sigma = V \, L \, V^{-1} \end{equation*}

Equation (15) is called the eigendecomposition of the covariance matrix and can be obtained using a Singular Value Decomposition algorithm. Whereas the eigenvectors represent the directions of the largest variance of the data, the eigenvalues represent the magnitude of this variance in those directions. In other words, V represents a rotation matrix, while \sqrt{L} represents a scaling matrix. The covariance matrix can thus be decomposed further as:

(16)   \begin{equation*}  \Sigma = R \, S \, S \, R^{-1} \end{equation*}

where R=V is a rotation matrix and S=\sqrt{L} is a scaling matrix.

In equation (6) we defined a linear transformation T=R \, S. Since S is a diagonal scaling matrix, S = S^{\intercal}. Furthermore, since R is an orthogonal matrix, R^{-1} = R^{\intercal}. Therefore, T^{\intercal} = (R \, S)^{\intercal} = S^{\intercal} \, R^{\intercal} = S \, R^{-1}. The covariance matrix can thus be written as:

(17)   \begin{equation*}  \Sigma = R \, S \, S \, R^{-1} = T \, T^{\intercal}, \end{equation*}

In other words, if we apply the linear transformation defined by T=R \, S to the original white data D shown by figure 7, we obtain the rotated and scaled data D' with covariance matrix T \, T^{\intercal} = \Sigma' = R \, S \, S \, R^{-1}. This is illustrated by figure 10:

The covariance matrix represents a linear transformation of the original data

Figure 10. The covariance matrix represents a linear transformation of the original data.

The colored arrows in figure 10 represent the eigenvectors. The largest eigenvector, i.e. the eigenvector with the largest corresponding eigenvalue, always points in the direction of the largest variance of the data and thereby defines its orientation. Subsequent eigenvectors are always orthogonal to the largest eigenvector due to the orthogonality of rotation matrices.

Conclusion

In this article we showed that the covariance matrix of observed data is directly related to a linear transformation of white, uncorrelated data. This linear transformation is completely defined by the eigenvectors and eigenvalues of the data. While the eigenvectors represent the rotation matrix, the eigenvalues correspond to the square of the scaling factor in each dimension.

If you’re new to this blog, don’t forget to subscribe, or follow me on twitter!

JOIN MY NEWSLETTER
Receive my newsletter to get notified when new articles and code snippets become available on my blog!
We all hate spam. Your email address will not be sold or shared with anyone else.

The post A geometric interpretation of the covariance matrix appeared first on Computer vision for dummies.

]]>
https://www.visiondummy.com/2014/04/geometric-interpretation-covariance-matrix/feed/ 47
The Curse of Dimensionality in classification https://www.visiondummy.com/2014/04/curse-dimensionality-affect-classification/ https://www.visiondummy.com/2014/04/curse-dimensionality-affect-classification/#comments Wed, 16 Apr 2014 15:33:41 +0000 http://www.visiondummy.com/?p=332 In this article, we will discuss the so called ‘Curse of Dimensionality’, and explain why it is important when designing a classifier. In the following sections I will provide an intuitive explanation of this concept, illustrated by a clear example of overfitting due to the curse of dimensionality. Consider an example in which we have [...]

The post The Curse of Dimensionality in classification appeared first on Computer vision for dummies.

]]>
Introduction

In this article, we will discuss the so called ‘Curse of Dimensionality’, and explain why it is important when designing a classifier. In the following sections I will provide an intuitive explanation of this concept, illustrated by a clear example of overfitting due to the curse of dimensionality.

Consider an example in which we have a set of images, each of which depicts either a cat or a dog. We would like to create a classifier that is able to distinguish dogs from cats automatically. To do so, we first need to think about a descriptor for each object class that can be expressed by numbers, such that a mathematical algorithm, i.e. a classifier, can use these numbers to recognize the object. We could for instance argue that cats and dogs generally differ in color. A possible descriptor that discriminates these two classes could then consist of three number; the average red color, the average green color and the average blue color of the image under consideration. A simple linear classifier for instance, could combine these features linearly to decide on the class label:

If 0.5*red + 0.3*green + 0.2*blue > 0.6 : return cat;
else return dog;

However, these three color-describing numbers, called features, will obviously not suffice to obtain a perfect classification. Therefore, we could decide to add some features that describe the texture of the image, for instance by calculating the average edge or gradient intensity in both the X and Y direction. We now have 5 features that, in combination, could possibly be used by a classification algorithm to distinguish cats from dogs.

To obtain an even more accurate classification, we could add more features, based on color or texture histograms, statistical moments, etc. Maybe we can obtain a perfect classification by carefully defining a few hundred of these features? The answer to this question might sound a bit counter-intuitive: no we can not!. In fact, after a certain point, increasing the dimensionality of the problem by adding new features would actually degrade the performance of our classifier. This is illustrated by figure 1, and is often referred to as ‘The Curse of Dimensionality’.

Feature dimensionality versus classifier performance

Figure 1. As the dimensionality increases, the classifier’s performance increases until the optimal number of features is reached. Further increasing the dimensionality without increasing the number of training samples results in a decrease in classifier performance.

In the next sections we will review why the above is true, and how the curse of dimensionality can be avoided.

The curse of dimensionality and overfitting

In the earlier introduced example of cats and dogs, let’s assume there are an infinite number of cats and dogs living on our planet. However, due to our limited time and processing power, we were only able to obtain 10 pictures of cats and dogs. The end-goal in classification is then to train a classifier based on these 10 training instances, that is able to correctly classify the infinite number of dog and cat instances which we do not have any information about.

Now let’s use a simple linear classifier and try to obtain a perfect classification. We can start by a single feature, e.g. the average ‘red’ color in the image:

A 1D classification problem

Figure 2. A single feature does not result in a perfect separation of our training data.

Figure 2 shows that we do not obtain a perfect classification result if only a single feature is used. Therefore, we might decide to add another feature, e.g. the average ‘green’ color in the image:

2D classification problem

Figure 3.Adding a second feature still does not result in a linearly separable classification problem: No single line can separate all cats from all dogs in this example.

Finally we decide to add a third feature, e.g. the average ‘blue’ color in the image, yielding a three-dimensional feature space:

3D classification problem

Figure 4. Adding a third feature results in a linearly separable classification problem in our example. A plane exists that perfectly separates dogs from cats.

In the three-dimensional feature space, we can now find a plane that perfectly separates dogs from cats. This means that a linear combination of the three features can be used to obtain perfect classification results on our training data of 10 images:

Linearly separable classification problem

Figure 5. The more features we use, the higher the likelihood that we can successfully separate the classes perfectly.

The above illustrations might seem to suggest that increasing the number of features until perfect classification results are obtained is the best way to train a classifier, whereas in the introduction, illustrated by figure 1, we argued that this is not the case. However, note how the density of the training samples decreased exponentially when we increased the dimensionality of the problem.

In the 1D case (figure 2), 10 training instances covered the complete 1D feature space, the width of which was 5 unit intervals. Therefore, in the 1D case, the sample density was 10/5=2 samples/interval. In the 2D case however (figure 3), we still had 10 training instances at our disposal, which now cover a 2D feature space with an area of 5×5=25 unit squares. Therefore, in the 2D case, the sample density was 10/25 = 0.4 samples/interval. Finally, in the 3D case, the 10 samples had to cover a feature space volume of 5x5x5=125 unit cubes. Therefore, in the 3D case, the sample density was 10/125 = 0.08 samples/interval.

If we would keep adding features, the dimensionality of the feature space grows, and becomes sparser and sparser. Due to this sparsity, it becomes much more easy to find a separable hyperplane because the likelihood that a training sample lies on the wrong side of the best hyperplane becomes infinitely small when the number of features becomes infinitely large. However, if we project the highly dimensional classification result back to a lower dimensional space, a serious problem associated with this approach becomes evident:

Overfitting

Figure 6. Using too many features results in overfitting. The classifier starts learning exceptions that are specific to the training data and do not generalize well when new data is encountered.

Figure 6 shows the 3D classification results, projected onto a 2D feature space. Whereas the data was linearly separable in the 3D space, this is not the case in a lower dimensional feature space. In fact, adding the third dimension to obtain perfect classification results, simply corresponds to using a complicated non-linear classifier in the lower dimensional feature space. As a result, the classifier learns the appearance of specific instances and exceptions of our training dataset. Because of this, the resulting classifier would fail on real-world data, consisting of an infinite amount of unseen cats and dogs that often do not adhere to these exceptions.

This concept is called overfitting and is a direct result of the curse of dimensionality. Figure 7 shows the result of a linear classifier that has been trained using only 2 features instead of 3:

Linear classifier

Figure 7. Although the training data is not classified perfectly, this classifier achieves better results on unseen data than the one from figure 5.

Although the simple linear classifier with decision boundaries shown by figure 7 seems to perform worse than the non-linear classifier in figure 5, this simple classifier generalizes much better to unseen data because it did not learn specific exceptions that were only in our training data by coincidence. In other words, by using less features, the curse of dimensionality was avoided such that the classifier did not overfit the training data.

Figure 8 illustrates the above in a different manner. Let’s say we want to train a classifier using only a single feature whose value ranges from 0 to 1. Let’s assume that this feature is unique for each cat and dog. If we want our training data to cover 20% of this range, then the amount of training data needed is 20% of the complete population of cats and dogs. Now, if we add another feature, resulting in a 2D feature space, things change; To cover 20% of the 2D feature range, we now need to obtain 45% of the complete population of cats and dogs in each dimension (0.45^2 = 0.2). In the 3D case this gets even worse: to cover 20% of the 3D feature range, we need to obtain 58% of the population in each dimension (0.58^3 = 0.2).

The amount of training data grows exponentially with the number of dimensions

Figure 8. The amount of training data needed to cover 20% of the feature range grows exponentially with the number of dimensions.

In other words, if the amount of available training data is fixed, then overfitting occurs if we keep adding dimensions. On the other hand, if we keep adding dimensions, the amount of training data needs to grow exponentially fast to maintain the same coverage and to avoid overfitting.

In the above example, we showed that the curse of dimensionality introduces sparseness of the training data. The more features we use, the more sparse the data becomes such that accurate estimation of the classifier’s parameters (i.e. its decision boundaries) becomes more difficult. Another effect of the curse of dimensionality, is that this sparseness is not uniformly distributed over the search space. In fact, data around the origin (at the center of the hypercube) is much more sparse than data in the corners of the search space. This can be understood as follows:

Imagine a unit square that represents the 2D feature space. The average of the feature space is the center of this unit square, and all points within unit distance from this center, are inside a unit circle that inscribes the unit square. The training samples that do not fall within this unit circle are closer to the corners of the search space than to its center. These samples are difficult to classify because their feature values greatly differs (e.g. samples in opposite corners of the unit square). Therefore, classification is easier if most samples fall inside the inscribed unit circle, illustrated by figure 9:

Features at unit distance from their average fall inside a unit circle

Figure 9.Training samples that fall outside the unit circle are in the corners of the feature space and are more difficult to classify than samples near the center of the feature space.

An interesting question is now how the volume of the circle (hypersphere) changes relative to the volume of the square (hypercube) when we increase the dimensionality of the feature space. The volume of a unit hypercube of dimension d is always 1^d = 1. The volume of the inscribing hypersphere of dimension d and with radius 0.5 can be calculated as:

(1)   \begin{equation*} V(d) = \frac{\pi^{d/2}}{\Gamma(\frac{d}{2} + 1)}0.5^d. \end{equation*}

Figure 10 shows how the volume of this hypersphere changes when the dimensionality increases:

The volume of the hypersphere tends to zero as the dimensionality increases

Figure 10. The volume of the hypersphere tends to zero as the dimensionality increases.

This shows that the volume of the hypersphere tends to zero as the dimensionality tends to infinity, whereas the volume of the surrounding hypercube remains constant. This surprising and rather counter-intuitive observation partially explains the problems associated with the curse of dimensionality in classification: In high dimensional spaces, most of the training data resides in the corners of the hypercube defining the feature space. As mentioned before, instances in the corners of the feature space are much more difficult to classify than instances around the centroid of the hypersphere. This is illustrated by figure 11, which shows a 2D unit square, a 3D unit cube, and a creative visualization of an 8D hypercube which has 2^8 = 256 corners:

Highly dimensional feature spaces are sparse around their origin

Figure 11. As the dimensionality increases, a larger percentage of the training data resides in the corners of the feature space.

For an 8-dimensional hypercube, about 98% of the data is concentrated in its 256 corners. As a result, when the dimensionality of the feature space goes to infinity, the ratio of the difference in minimum and maximum Euclidean distance from sample point to the centroid, and the minimum distance itself, tends to zero:

(2)   \begin{equation*} \lim_{d \to \infty} \frac{\operatorname{dist}_{\max} - \operatorname{dist}_{\min}}{\operatorname{dist}_{\min}} \to 0 \end{equation*}

Therefore, distance measures start losing their effectiveness to measure dissimilarity in highly dimensional spaces. Since classifiers depend on these distance measures (e.g. Euclidean distance, Mahalanobis distance, Manhattan distance), classification is often easier in lower-dimensional spaces where less features are used to describe the object of interest. Similarly, Gaussian likelihoods become flat and heavy tailed distributions in high dimensional spaces, such that the ratio of the difference between the minimum and maximum likelihood and the minimum likelihood itself tends to zero.

How to avoid the curse of dimensionality?

Figure 1 showed that the performance of a classifier decreases when the dimensionality of the problem becomes too large. The question then is what ‘too large’ means, and how overfitting can be avoided. Regrettably there is no fixed rule that defines how many feature should be used in a classification problem. In fact, this depends on the amount of training data available, the complexity of the decision boundaries, and the type of classifier used.

If the theoretical infinite number of training samples would be available, the curse of dimensionality does not apply and we could simply use an infinite number of features to obtain perfect classification. The smaller the size of the training data, the less features should be used. If N training samples suffice to cover a 1D feature space of unit interval size, then N^2 samples are needed to cover a 2D feature space with the same density, and N^3 samples are needed in a 3D feature space. In other words, the number of training instances needed grows exponentially with the number of dimensions used.

Furthermore, classifiers that tend to model non-linear decision boundaries very accurately (e.g. neural networks, KNN classifiers, decision trees) do not generalize well and are prone to overfitting. Therefore, the dimensionality should be kept relatively low when these classifiers are used. If a classifier is used that generalizes easily (e.g. naive Bayesian, linear classifier), then the number of used features can be higher since the classifier itself is less expressive. Figure 6 showed that using a simple classifier model in a high dimensional space corresponds to using a complex classifier model in a lower dimensional space.

Therefore, overfitting occurs both when estimating relatively few parameters in a highly dimensional space, and when estimating a lot of parameters in a lower dimensional space. As an example, consider a Gaussian density function, parameterized by its mean and covariance matrix. Let’s say we operate in a 3D space, such that the covariance matrix is a 3×3 symmetric matrix consisting of 6 unique elements (3 variances on the diagonal and 3 covariances off-diagonal). Together with the 3D mean of the distribution this means that we need to estimate 9 parameters based on our training data, to obtain the Gaussian density that represent the likelihood of our data. In the 1D case, only 2 parameters need to be estimated (mean and variance), whereas in the 2D case 5 parameters are needed (2D mean, two variances and a covariance). Again we can see that the number of parameters to be estimated grows quadratic with the number of dimensions.

In an earlier article we showed that the variance of a parameter estimate increases if the number of parameters to be estimated increases (and if the bias of the estimate and the amount of training data are kept constant). This means that the quality of our parameter estimates decreases if the dimensionality goes up, due to the increase of variance. An increase of classifier variance corresponds to overfitting.

Another interesting question is which features should be used. Given a set of N features; how do we select an optimal subset of M features such that M<N? One approach would be to search for the optimum in the curve shown by figure 1. Since it is often intractable to train and test classifiers for all possible combinations of all features, several methods exist that try to find this optimum in different manners. These methods are called feature selection algorithms and often employ heuristics (greedy methods, best-first methods, etc.) to locate the optimal number and combination of features.

Another approach would be to replace the set of N features by a set of M features, each of which is a combination of the original feature values. Algorithms that try to find the optimal linear or non-linear combination of original features to reduce the dimensionality of the final problem are called Feature Extraction methods. A well known dimensionality reduction technique that yields uncorrelated, linear combinations of the original N features is Principal Component Analysis (PCA). PCA tries to find a linear subspace of lower dimensionality, such that the largest variance of the original data is kept. However, note that the largest variance of the data not necessarily represents the most discriminative information.

Finally, an invaluable technique used to detect and avoid overfitting during classifier training is cross-validation. Cross validation approaches split the original training data into one or more training subsets. During classifier training, one subset is used to test the accuracy and precision of the resulting classifier, while the others are used for parameter estimation. If the classification results on the subsets used for training greatly differ from the results on the subset used for testing, overfitting is in play. Several types of cross-validation such as k-fold cross-validation and leave-one-out cross-validation can be used if only a limited amount of training data is available.

Conclusion

In this article we discussed the importance of feature selection, feature extraction, and cross-validation, in order to avoid overfitting due to the curse of dimensionality. Using a simple example, we reviewed an important effect of the curse of dimensionality in classifier training, namely overfitting.

If you’re new to this blog, don’t forget to subscribe, or follow me on twitter!

JOIN MY NEWSLETTER
Receive my newsletter to get notified when new articles and code snippets become available on my blog!
We all hate spam. Your email address will not be sold or shared with anyone else.

The post The Curse of Dimensionality in classification appeared first on Computer vision for dummies.

]]>
https://www.visiondummy.com/2014/04/curse-dimensionality-affect-classification/feed/ 52
How to draw a covariance error ellipse? https://www.visiondummy.com/2014/04/draw-error-ellipse-representing-covariance-matrix/ https://www.visiondummy.com/2014/04/draw-error-ellipse-representing-covariance-matrix/#comments Thu, 03 Apr 2014 16:42:10 +0000 http://www.visiondummy.com/?p=288 In this post, I will show how to draw an error ellipse, a.k.a. confidence ellipse, for 2D normally distributed data. The error ellipse represents an iso-contour of the Gaussian distribution, and allows you to visualize a 2D confidence interval. The following figure shows a 95% confidence ellipse for a set of 2D normally distributed data [...]

The post How to draw a covariance error ellipse? appeared first on Computer vision for dummies.

]]>
Introduction

In this post, I will show how to draw an error ellipse, a.k.a. confidence ellipse, for 2D normally distributed data. The error ellipse represents an iso-contour of the Gaussian distribution, and allows you to visualize a 2D confidence interval. The following figure shows a 95% confidence ellipse for a set of 2D normally distributed data samples. This confidence ellipse defines the region that contains 95% of all samples that can be drawn from the underlying Gaussian distribution.

Error ellipse

Figure 1. 2D confidence ellipse for normally distributed data

In the next sections we will discuss how to obtain confidence ellipses for different confidence values (e.g. 99% confidence interval), and we will show how to plot these ellipses using Matlab or C++ code.

Axis-aligned confidence ellipses

Before deriving a general methodology to obtain an error ellipse, let’s have a look at the special case where the major axis of the ellipse is aligned with the X-axis, as shown by the following figure:

Confidence ellipse

Figure 2. Confidence ellipse for uncorrelated Gaussian data

The above figure illustrates that the angle of the ellipse is determined by the covariance of the data. In this case, the covariance is zero, such that the data is uncorrelated, resulting in an axis-aligned error ellipse.

Table 1. Covariance matrix of the data shown in Figure 2
8.4213 0
0 0.9387

Furthermore, it is clear that the magnitudes of the ellipse axes depend on the variance of the data. In our case, the largest variance is in the direction of the X-axis, whereas the smallest variance lies in the direction of the Y-axis.

In general, the equation of an axis-aligned ellipse with a major axis of length 2a and a minor axis of length 2b, centered at the origin, is defined by the following equation:

(1)   \begin{equation*} \left(\frac{ x } { a }\right)^2 + \left(\frac{ y } { b }\right)^2 = 1 \end{equation*}

In our case, the length of the axes are defined by the standard deviations \sigma_x and \sigma_y of the data such that the equation of the error ellipse becomes:

(2)   \begin{equation*}  \left(\frac{ x } { \sigma_x }\right)^2 + \left(\frac{ y } { \sigma_y }\right)^2 = s \end{equation*}

where s defines the scale of the ellipse and could be any arbitrary number (e.g. s=1). The question is now how to choose s, such that the scale of the resulting ellipse represents a chosen confidence level (e.g. a 95% confidence level corresponds to s=5.991).

Our 2D data is sampled from a multivariate Gaussian with zero covariance. This means that both the x-values and the y-values are normally distributed too. Therefore, the left hand side of equation (2) actually represents the sum of squares of independent normally distributed data samples. The sum of squared Gaussian data points is known to be distributed according to a so called Chi-Square distribution. A Chi-Square distribution is defined in terms of ‘degrees of freedom’, which represent the number of unknowns. In our case there are two unknowns, and therefore two degrees of freedom.

Therefore, we can easily obtain the probability that the above sum, and thus s equals a specific value by calculating the Chi-Square likelihood. In fact, since we are interested in a confidence interval, we are looking for the probability that s is less then or equal to a specific value which can easily be obtained using the cumulative Chi-Square distribution. As statisticians are lazy people, we usually don’t try to calculate this probability, but simply look it up in a probability table: https://people.richland.edu/james/lecture/m170/tbl-chi.html.

For example, using this probability table we can easily find that, in the 2-degrees of freedom case:

    \begin{equation*} P(s < 5.991) = 1-0.05 = 0.95 \end{equation*}

Therefore, a 95% confidence interval corresponds to s=5.991. In other words, 95% of the data will fall inside the ellipse defined as:

(3)   \begin{equation*} \left(\frac{ x } { \sigma_x }\right)^2 + \left(\frac{ y } { \sigma_y }\right)^2 = 5.991 \end{equation*}

Similarly, a 99% confidence interval corresponds to s=9.210 and a 90% confidence interval corresponds to s=4.605.

The error ellipse show by figure 2 can therefore be drawn as an ellipse with a major axis length equal to 2\sigma_x \sqrt{5.991} and the minor axis length to 2\sigma_y \sqrt{5.991}.

Arbitrary confidence ellipses

In cases where the data is not uncorrelated, such that a covariance exists, the resulting error ellipse will not be axis aligned. In this case, the reasoning of the above paragraph only holds if we temporarily define a new coordinate system such that the ellipse becomes axis-aligned, and then rotate the resulting ellipse afterwards.

In other words, whereas we calculated the variances \sigma_x and \sigma_y parallel to the x-axis and y-axis earlier, we now need to calculate these variances parallel to what will become the major and minor axis of the confidence ellipse. The directions in which these variances need to be calculated are illustrated by a pink and a green arrow in figure 1.

Error ellipse

Figure 1. 2D confidence ellipse for normally distributed data

These directions are actually the directions in which the data varies the most, and are defined by the covariance matrix. The covariance matrix can be considered as a matrix that linearly transformed some original data to obtain the currently observed data. In a previous article about eigenvectors and eigenvalues we showed that the direction vectors along such a linear transformation are the eigenvectors of the transformation matrix. Indeed, the vectors shown by pink and green arrows in figure 1, are the eigenvectors of the covariance matrix of the data, whereas the length of the vectors corresponds to the eigenvalues.


The eigenvalues therefore represent the spread of the data in the direction of the eigenvectors. In other words, the eigenvalues represent the variance of the data in the direction of the eigenvectors. In the case of axis aligned error ellipses, i.e. when the covariance equals zero, the eigenvalues equal the variances of the covariance matrix and the eigenvectors are equal to the definition of the x-axis and y-axis. In the case of arbitrary correlated data, the eigenvectors represent the direction of the largest spread of the data, whereas the eigenvalues define how large this spread really is.

Thus, the 95% confidence ellipse can be defined similarly to the axis-aligned case, with the major axis of length 2\sqrt{5.991 \lambda_1} and the minor axis of length 2\sqrt{5.991 \lambda_2}, where \lambda_1 and \lambda_2 represent the eigenvalues of the covariance matrix.

To obtain the orientation of the ellipse, we simply calculate the angle of the largest eigenvector towards the x-axis:

(4)   \begin{equation*} \alpha = \arctan \frac{\mathbf{v}_1(y)}{\mathbf{v}_1(x)} \end{equation*}

where \mathbf{v}_1 is the eigenvector of the covariance matrix that corresponds to the largest eigenvalue.

Based on the minor and major axis lengths and the angle \alpha between the major axis and the x-axis, it becomes trivial to plot the confidence ellipse. Figure 3 shows error ellipses for several confidence values:

Error ellipses

Confidence ellipses for normally distributed data

Source Code

Matlab source code
C++ source code (uses OpenCV)

Conclusion

In this article we showed how to obtain the error ellipse for 2D normally distributed data, according to a chosen confidence value. This is often useful when visualizing or analyzing data and will be of interest in a future article about PCA.

Furthermore, source code samples were provided for Matlab and C++.

If you’re new to this blog, don’t forget to subscribe, or follow me on twitter!

JOIN MY NEWSLETTER
Receive my newsletter to get notified when new articles and code snippets become available on my blog!
We all hate spam. Your email address will not be sold or shared with anyone else.

The post How to draw a covariance error ellipse? appeared first on Computer vision for dummies.

]]>
https://www.visiondummy.com/2014/04/draw-error-ellipse-representing-covariance-matrix/feed/ 62
Why divide the sample variance by N-1? https://www.visiondummy.com/2014/03/divide-variance-n-1/ https://www.visiondummy.com/2014/03/divide-variance-n-1/#comments Fri, 07 Mar 2014 15:29:22 +0000 http://www.visiondummy.com/?p=196 In this article, we will derive the well known formulas for calculating the mean and the variance of normally distributed data, in order to answer the question in the article’s title. However, for readers who are not interested in the ‘why’ of this question but only in the ‘when’, the answer is quite simple: If [...]

The post Why divide the sample variance by N-1? appeared first on Computer vision for dummies.

]]>
Introduction

In this article, we will derive the well known formulas for calculating the mean and the variance of normally distributed data, in order to answer the question in the article’s title. However, for readers who are not interested in the ‘why’ of this question but only in the ‘when’, the answer is quite simple:

If you have to estimate both the mean and the variance of the data (which is typically the case), then divide by N-1, such that the variance is obtained as:

    \begin{equation*} \sigma^2 = \frac{1}{N-1}\sum_{i=1}^N (x_i - \mu)^2 \end{equation*}

If, on the other hand, the mean of the true population is known such that only the variance needs to be estimated, then divide by N, such that the variance is obtained as:

    \begin{equation*} \sigma^2 = \frac{1}{N}\sum_{i=1}^N (x_i - \mu)^2 \end{equation*}

Whereas the former is what you will typically need, an example of the latter would be the estimation of the spread of white Gaussian noise. Since the mean of white Gaussian noise is known to be zero, only the variance needs to be estimated in this case.

If data is normally distributed we can completely characterize it by its mean \mu and its variance \sigma^2. The variance is the square of the standard deviation \sigma which represents the average deviation of each data point to the mean. In other words, the variance represents the spread of the data. For normally distributed data, 68.3% of the observations will have a value between \mu-\sigma and \mu+\sigma. This is illustrated by the following figure which shows a Gaussian density function with mean \mu=10 and variance \sigma^2 = 3^2 = 9:

Gaussian density

Figure 1. Gaussian density function. For normally distributed data, 68% of the samples fall within the interval defined by the mean plus and minus the standard deviation.

Usually we do not have access to the complete population of the data. In the above example, we would typically have a few observations at our disposal but we do not have access to all possible observations that define the x-axis of the plot. For example, we might have the following set of observations:

Table 1
Observation ID Observed Value
Observation 1 10
Observation 2 12
Observation 3 7
Observation 4 5
Observation 5 11

If we now calculate the empirical mean by summing up all values and dividing by the number of observations, we have:

(1)   \begin{equation*} \mu = \frac{10+12+7+5+11}{5} = 9. \end{equation*}

Usually we assume that the empirical mean is close to the actually unknown mean of the distribution, and thus assume that the observed data is sampled from a Gaussian distribution with mean \mu=9. In this example, the actual mean of the distribution is 10, so the empirical mean indeed is close to the actual mean.

The variance of the data is calculated as follows:

(2)   \begin{equation*} \sigma^2 = \frac{1}{N-1}\sum_{i=1}^N (x_i - \mu)^2 = \frac{(10-9)^2+(12-9)^2+(7-9)^2+(5-9)^2+(11-9)^2}{4}) = 8.5. \end{equation*}

Again, we usually assume that this empirical variance is close to the real and unknown variance of underlying distribution. In this example, the real variance was 9, so indeed the empirical variance is close to the real variance.

The question at hand is now why the formulas used to calculate the empirical mean and the empirical variance are correct. In fact, another often used formula to calculate the variance, is defined as follows:

(3)   \begin{equation*} \sigma^2 = \frac{1}{N}\sum_{i=1}^N (x_i - \mu)^2 = \frac{(10-9)^2+(12-9)^2+(7-9)^2+(5-9)^2+(11-9)^2}{5}) = 6.8. \end{equation*}

The only difference between equation (2) and (3) is that the former divides by N-1, whereas the latter divides by N. Both formulas are actually correct, but when to use which one depends on the situation.

In the following sections, we will completely derive the formulas that best approximate the unknown variance and mean of a normal distribution, given a few samples from this distribution. We will show in which cases to divide the variance by N and in which cases to normalize by N-1.

A formula that approximates a parameter (mean or variance) is called an estimator. In the following, we will denote the real and unknown parameters of the distribution by \hat{\mu} and \hat{\sigma}^2. The estimators, e.g. the empirical average and empirical variance, are denoted as \mu and \sigma^2.

To find the optimal estimators, we first need an analytical expression for the likelihood of observing a specific data point x_i, given the fact that the population is normally distributed with a given mean \mu and standard deviation \sigma. A normal distribution with known parameters is usually denoted as N(\mu, \sigma^2). The likelihood function is then:

(4)   \begin{align*} &x_i \sim N(\mu, \sigma^2) \\ &\Rightarrow P(x_i; \mu, \sigma) = \frac{1}{\sqrt{2 \pi \sigma^2}} e^{-\frac{1}{2\sigma^2}(x_i - \mu)^2}. \end{align*}

To calculate the mean and variance, we obviously need more than one sample from this distribution. In the following, let vector \vec{x}=(x_1, x_2,... x_N) be a vector that contains all the available samples (e.g. all the values from the example in table 1). If all these samples are statistically independent, we can write their joint likelihood function as the sum of all individual likelihoods:

(5)   \begin{equation*} P(\vec{x}; \mu, \sigma^2) = P(x_1, x_2, ..., x_N; \mu, \sigma^2) = P(x_1; \mu, \sigma^2)P(x_2; \mu, \sigma^2)...P(x_N; \mu, \sigma^2) = \prod_{i=1}^N P(x_i; \mu, \sigma^2) \end{equation*}

Plugging equation (4) into equation (5) then yields an analytical expression for this joint probability density function:

(6)   \begin{align*} P(\vec{x}; \mu, \sigma^2) &= \prod_{i=1}^N \frac{1}{\sqrt{2 \pi \sigma^2}} e^{-\frac{1}{2\sigma^2}(x_i - \mu)^2}\\ &= \frac{1}{(2 \pi \sigma^2)^{\frac{N}{2}}} e^{-\frac{1}{2\sigma^2}\sum_{i=1}^N(x_i - \mu)^2} \end{align*}

Equation (6) will be important in the next sections and will be used to derive the well known expressions for the estimators of the mean and the variance of a Gaussian distribution.

Minimum variance, unbiased estimators

To determine if an estimator is a ‘good’ estimator, we first need to define what a ‘good’ estimator really is. The goodness of an estimator depends on two measures, namely its bias and its variance (yes, we will talk about the variance of the mean-estimator and the variance of the variance-estimator). Both measures are briefly discussed in this section.

Parameter bias

Imagine that we could obtain different (disjoint) subsets of the complete population. In analogy to our previous example, imagine that, apart from the data in Table 1, we also have a Table 2 and a Table 3 with different observations. Then a good estimator for the mean, would be an estimator that on average would be equal to the real mean. Although we can live with the idea that the empirical mean from one subset of data is not equal to the real mean like in our example, a good estimator should make sure that the average of the estimated means from all subsets is equal to the real mean. This constraint is expressed mathematically by stating that the Expected Value of the estimator should equal the real parameter value:

(7)   \begin{align*}  &E[\mu] = \hat{\mu}\\ &E[\sigma^2] = \hat{\sigma^2} \end{align*}

If the above conditions hold, then the estimators are called ‘unbiased estimators’. If the conditions do not hold, the estimators are said to be ‘biased’, since on average they will either underestimate or overestimate the true value of the parameter.

Parameter variance

Unbiased estimators guarantee that on average they yield an estimate that equals the real parameter. However, this does not mean that each estimate is a good estimate. For instance, if the real mean is 10, an unbiased estimator could estimate the mean as 50 on one population subset and as -30 on another subset. The expected value of the estimate would then indeed be 10, which equals the real parameter, but the quality of the estimator clearly also depends on the spread of each estimate. An estimator that yields the estimates (10, 15, 5, 12, 8) for five different subsets of the population is unbiased just like an estimator that yields the estimates (50, -30, 100, -90, 10). However, all estimates from the first estimator are closer to the true value than those from the second estimator.

Therefore, a good estimator not only has a low bias, but also yields a low variance. This variance is expressed as the mean squared error of the estimator:

    \begin{align*} &Var(\mu) = E[(\hat{\mu} - \mu)^2]\\ &Var(\sigma^2) = E[(\hat{\sigma} - \sigma)^2] \end{align*}

A good estimator is therefore is a low bias, low variance estimator. The optimal estimator, if such estimator exists, is then the one that has no bias and a variance that is lower than any other possible estimator. Such an estimator is called the minimum variance, unbiased (MVU) estimator. In the next section, we will derive the analytical expressions for the mean and the variance estimators of a Gaussian distribution. We will show that the MVU estimator for the variance of a normal distribution requires us to divide the variance by N under certain assumptions, and requires us to divide by N-1 if these assumptions do not hold.

Maximum Likelihood estimation

Although numerous techniques can be used to obtain an estimator of the parameters based on a subset of the population data, the simplest of all is probably the maximum likelihood approach.

The probability of observing \vec{x} was defined by equation (6) as P(\vec{x}; \mu, \sigma^2). If we fix \mu and \sigma^2 in this function, while letting \vec{x} vary, we obtain the Gaussian distribution as plotted by figure 1. However, we could also choose a fixed \vec{x} and let \mu and/or \sigma vary. For example, we can choose \vec{x}=(10, 12, 7, 5, 11) like in our previous example. We also choose a fixed \mu=10, and we let \sigma^2 vary. Figure 2 shows the plot of each different value of \sigma^2 for the distribution with the proposed fixed \vec{x} and \mu:

Maximum likelihood parameter estimation

Figure 2. This plot shows the likelihood of observing fixed data \vec{x} if the data is normally distributed with a chosen, fixed \mu=10, plotted against various values of a varying \sigma^2.

In the above figure, we calculated the likelihood P(\vec{x};\sigma^2) by varying \sigma^2 for a fixed \mu=10. Each point in the resulting curve represents the likelihood that observation \vec{x} is a sample from a Gaussian distribution with parameter \sigma^2. The parameter value that corresponds to the highest likelihood is then most likely the parameter that defines the distribution our data originated from. Therefore, we can determine the optimal \sigma^2 by finding the maximum in this likelihood curve. In this example, the maximum is at \sigma^2 = 7.8, such that the standard deviation is \sqrt{(\sigma^2)} = 2.8. Indeed if we would calculate the variance in the traditional way, with a given \mu=10, we would find that it is equal to 7.8:

    \begin{equation*} \frac{(10-10)^2+(12-10)^2+(7-10)^2+(5-10)^2+(11-10)^2}{5} = 7.8$. \end{equation*}

Therefore, the formula to compute the variance based on the sample data is simply derived by finding the peak of the maximum likelihood function. Furthermore, instead of fixing \mu, we let both \mu and \sigma^2 vary at the same time. Finding both estimators then corresponds to finding the maximum in a two-dimensional likelihood function.

To find the maximum of a function, we simply set its derivative to zero. If we want to find the maximum of a function with two variables, we calculate the partial derivative towards each of these variables and set both to zero. In the following, let \hat{\mu}_{ML} be the optimal estimator for the population mean as obtained using the maximum likelihood method, and let \hat{\sigma}^2_{ML} be the optimal estimator for the variance. To maximize the likelihood function we simply calculate its (partial) derivatives and set them to zero as follows:

    \begin{align*} &\hat{\mu}_{ML} = \arg\max_\mu P(\vec{x}; \mu, \sigma^2)\\ &\Rightarrow \frac{\partial P(\vec{x}; \mu, \sigma^2)}{\partial \mu} = 0 \end{align*}

and

    \begin{align*} &\hat{\sigma}^2_{ML} = \arg\max_{\sigma^2} P(\vec{x}; \mu, \sigma^2)\\ &\Rightarrow \frac{\partial P(\vec{x}; \mu, \sigma^2)}{\partial \sigma^2} = 0 \end{align*}

In the following paragraphs we will use this technique to obtain the MVU estimators of both \hat{\mu} and \hat{\sigma}. We consider two cases:

The first case assumes that the true mean of the distribution \hat{\mu} is known. Therefore, we only need to estimate the variance and the problem then corresponds to finding the maximum in a one-dimensional likelihood function, parameterized by \sigma^2. Although this situation does not occur often in practice, it definitely has practical applications. For instance, if we know that a signal (e.g. the color value of a pixel in an image) should have a specific value, but the signal has been polluted by white noise (Gaussian noise with zero mean), then the mean of the distribution is known and we only need to estimate the variance.

The second case deals with the situation where both the true mean and the true variance are unknown. This is the case you would encounter most and where you would obtain an estimate of the mean and the variance based on your sample data.

In the next paragraphs we will show that each case results in a different MVU estimator. More specific, the first case requires the variance estimator to be normalized by N to be MVU, whereas the second case requires division by N-1 to be MVU.

Estimating the variance if the mean is known

Parameter estimation

If the true mean of the distribution is known, then the likelihood function is only parameterized on \sigma^2. Obtaining the maximum likelihood estimator then corresponds to solving:

(8)   \begin{equation*} &\hat{\sigma}^2_{ML} = \arg\max_{\sigma^2} P(\vec{x}; \sigma^2) \end{equation*}

However, calculating the derivative of P(\vec{x}; \sigma^2), defined by equation (6) is rather involved due to the exponent in the function. In fact, it is much easier to maximize the log-likelihood function instead of the likelihood function itself. Since the logarithm is a monotonous function, the maximum will be the same. Therefore, we solve the following problem instead:

(9)   \begin{equation*} &\hat{\sigma}^2_{ML} = \arg\max_{\sigma^2}\log(P(\vec{x}; \sigma^2)). \end{equation*}

In the following we set s=\sigma^2 to obtain a simpler notation. To find the maximum of the log-likelihood function, we simply calculate the derivative of the logarithm of equation (6) and set it to zero:

    \begin{align*} &\frac{\partial \log(P(\vec{x}; \sigma^2))}{\partial \sigma^2} = 0\\ &\Leftrightarrow \frac{\partial \log(P(\vec{x}; s))}{\partial s} = 0\\ &\Leftrightarrow \frac{\partial}{\partial s} \log \left( \frac{1}{(2 \pi s)^{\frac{N}{2}}} e^{-\frac{1}{2s}\sum_{i=1}^N(x_i - \mu)^2} \right) = 0\\ &\Leftrightarrow \frac{\partial}{\partial s} \log \left( \frac{1}{(2 \pi)^{\frac{N}{2}}} \right) +  \frac{\partial}{\partial s} \log \left( \frac{1}{\sqrt{(s})^{\frac{N}{2}}} \right) + \frac{\partial}{\partial s} \log \left(e^{-\frac{1}{2s}\sum_{i=1}^N(x_i - \mu)^2} \right) = 0\\ &\Leftrightarrow \frac{\partial}{\partial s} \log \left( (s)^{-\frac{N}{2}} \right) + \frac{\partial}{\partial s} \left(-\frac{1}{2s}\sum_{i=1}^N(x_i - \mu)^2 \right) = 0\\ &\Leftrightarrow -\frac{N}{2} \frac{\partial}{\partial s} \log \left( s \right) - \frac{1}{2} \sum_{i=1}^N(x_i - \mu)^2 \frac{\partial}{\partial s} \left(\frac{1}{s}\right) = 0\\ &\Leftrightarrow -\frac{N}{2s} + \frac{1}{2} \sum_{i=1}^N(x_i - \mu)^2 \left(\frac{1}{s^2}\right) = 0\\ &\Leftrightarrow \frac{N}{2s^2} \left (-s + \frac{1}{N} \sum_{i=1}^N(x_i - \mu)^2 \right) = 0\\ &\Leftrightarrow \frac{N}{2s^2} \left (\frac{1}{N} \sum_{i=1}^N(x_i - \mu)^2 - s \right) = 0\\ \end{align*}

It is clear that if N > 0, then the only possible solution to the above is:

(10)   \begin{equation*} s = \sigma^2 = \frac{1}{N}\sum_{i=1}^N(x_i - \mu)^2. \end{equation*}

Note that this maximum likelihood estimator for \hat{\sigma} is indeed the traditional formula to calculate the variance of normal data. The normalization factor is \frac{1}{N}.

However, the maximum likelihood method does not guarantee to deliver an unbiased estimator. On the other hand, if the obtained estimator is unbiased, then the maximum likelihood method does guarantee that the estimator is also minimum variance and thus MVU. Therefore, we need to check if the estimator in equation (10) is unbiassed.

Performance evaluation

To check if the estimator defined by equation (10) is unbiassed, we need to check if the condition of equation (7) holds, and thus if

    \begin{equation*} E[s] = \hat{s}. \end{equation*}

To do this, we plug equation (10) into E[s] and write:

    \begin{align*} E[s] &= E \left[\frac{1}{N}\sum_{i=1}^N(x_i - \mu)^2 \right] = \frac{1}{N} \sum_{i=1}^N E \left[(x_i - \mu)^2 \right] = \frac{1}{N} \sum_{i=1}^N E \left[x_i^2 - 2x_i \mu + \mu^2 \right]\\ &= \frac{1}{N} \left( N E[x_i^2] -2N \mu E[x_i] + N \mu^2 \right) \\ &= \frac{1}{N} \left( N E[x_i^2] -2N \mu^2 + N \mu^2 \right) \\ &= \frac{1}{N} \left( N E[x_i^2] -N \mu^2 \right) \\ \end{align*}

Furthermore, an important property of variance is that the true variance \hat{s} can be written as \hat{s} = E[x_i^2] - E[x_i]^2 such that E[x_i^2] = \hat{s} + E[x_i]^2 = \hat{s} + \mu^2. Using this property in the above equation yields:

    \begin{align*} E[s] &= \frac{1}{N} \left( N E[x_i^2] -N \mu^2 \right) \\ &= \frac{1}{N} \left( N \hat{s} + N \mu^2 -N \mu^2 \right)\\ &= \frac{1}{N} \left( N \hat{s} \right)\\ &= \hat{s} \end{align*}

Since E[s]=\hat{s}, the condition shown by equation (7) holds, and therefore the obtained estimator for the variance \hat{s} of the data is unbiassed. Furthermore, because the maximum likelihood method guarantees that an unbiased estimator is also minimum variance (MVU), this means that no other estimator exists that can do better than the one obtained here.
Therefore, we have to divide by N instead of N-1 while calculating the variance of normally distributed data, if the true mean of the underlying distribution is known.

Estimating the variance if the mean is unknown

Parameter estimation

In the previous section, the true mean of the distribution was known, such that we only had to find an estimator for the variance of the data. However, if the true mean is not known, then an estimator has to be found for the mean too. Furthermore, this mean estimate is used by the variance estimator. As a result, we will show that the earlier obtained estimator for the variance is no longer unbiassed. Furthermore, we will show that we can ‘unbias’ the estimator in this case by dividing by N-1 instead of by N, which slightly increases the variance of the estimator.

As before, we use the maximum likelihood method to obtain the estimators based on the log-likelihood function. We first find the ML estimator for \hat{\mu}:

    \begin{align*} &\frac{\partial \log(P(\vec{x}; s, \mu))}{\partial \mu} = 0\\ &\Leftrightarrow \frac{\partial}{\partial \mu} \log \left( \frac{1}{(2 \pi s)^{\frac{N}{2}}} e^{-\frac{1}{2s}\sum_{i=1}^N(x_i - \mu)^2} \right) = 0\\ &\Leftrightarrow \frac{\partial}{\partial \mu} \log \left( \frac{1}{(2 \pi)^{\frac{N}{2}}} \right) + \frac{\partial}{\partial \mu} \log \left(e^{-\frac{1}{2s}\sum_{i=1}^N(x_i - \mu)^2} \right) = 0\\ &\Leftrightarrow \frac{\partial}{\partial \mu} \left(-\frac{1}{2s}\sum_{i=1}^N(x_i - \mu)^2 \right) = 0\\ &\Leftrightarrow -\frac{1}{2s}\frac{\partial}{\partial \mu} \left(\sum_{i=1}^N(x_i - \mu)^2 \right) = 0\\ &\Leftrightarrow -\frac{1}{2s} \left(\sum_{i=1}^N -2(x_i - \mu) \right) = 0\\ &\Leftrightarrow \frac{1}{s} \left(\sum_{i=1}^N (x_i - \mu) \right) = 0\\ &\Leftrightarrow \frac{N}{s} \left( \frac{1}{N} \sum_{i=1}^N (x_i) - \mu \right) = 0\\ \end{align*}

If N>0, then it is clear that the above equation only has a solution if:

(11)   \begin{equation*} \mu = \frac{1}{N} \sum_{i=1}^N (x_i). \end{equation*}

Note that indeed this is the well known formula to calculate the mean of a distribution. Although we all knew this formula, we now proved that it is the maximum likelihood estimator for the true and unknown mean \hat{\mu} of a normal distribution. For now, we will just assume that the estimator that we found earlier for the variance \hat{s}, defined by equation (10), is still the MVU variance estimator. In the next section however, we will show that this estimator is no longer unbiased now.

Performance evaluation

To check if the estimator \mu for the true mean \hat{\mu} is unbiassed, we have to make sure that the condition of equation (7) holds:

    \begin{equation*} E[\mu] = E \left[\frac{1}{N} \sum_{i=1}^N (x_i) \right] = \frac{1}{N}\sum_{i=1}^N E[x_i] = \frac{1}{N} N E[x_i] = \frac{1}{N} N \hat{\mu} = \hat{\mu}.  \end{equation*}

Since E[\mu] = \hat{\mu}, this means that the obtained estimator for the mean of the distribution is unbiassed. Since the maximum likelihood method guarantees to deliver the minimum variance estimator if the estimator is unbiassed, we proved that \mu is the MVU estimator of the mean.

To check if the earlier found estimator s for the variance \hat{s} is still unbiassed if it is based on the empirical mean \mu instead of the true mean \hat{\mu}, we simply plug the obtained estimator \mu into the earlier derived estimator s of equation (10):

    \begin{align*} s &= \sigma^2 = \frac{1}{N}\sum_{i=1}^N(x_i - \mu)^2\\ &=\frac{1}{N}\sum_{i=1}^N \left(x_i - \frac{1}{N} \sum_{i=1}^N (x_i) \right)^2\\ &=\frac{1}{N}\sum_{i=1}^N \left[x_i^2 - 2 x_i \frac{1}{N} \sum_{i=1}^N (x_i) + \left(\frac{1}{N} \sum_{i=1}^N (x_i) \right)^2 \right]\\ &=\frac{\sum_{i=1}^N x_i^2}{N} - \frac{2\sum_{i=1}^N x_i \sum_{i=1}^N x_i}{N^2} + \left(\frac{\sum_{i=1}^N x_i}{N} \right)^2\\ &=\frac{\sum_{i=1}^N x_i^2}{N} - \frac{2\sum_{i=1}^N x_i \sum_{i=1}^N x_i}{N^2} + \left(\frac{\sum_{i=1}^N x_i}{N} \right)^2\\ &=\frac{\sum_{i=1}^N x_i^2}{N} - \left(\frac{\sum_{i=1}^N x_i}{N} \right)^2\\ \end{align*}

To check if the estimator is still unbiased, we now need to check again if the condition of equation (7) holds:

    \begin{align*} E[s] &= E \left[ \frac{\sum_{i=1}^N x_i^2}{N} - \left(\frac{\sum_{i=1}^N x_i}{N} \right)^2 \right ] \\ & = \frac{\sum_{i=1}^N E[x_i^2]}{N} - \frac{E[(\sum_{i=1}^N x_i)^2]}{N^2} \\ \end{align*}

As mentioned in the previous section, an important property of variance is that the true variance \hat{s} can be written as \hat{s} = E[x_i^2] - E[x_i]^2 such that E[x_i^2] = \hat{s} + E[x_i]^2 = \hat{s} + \mu^2. Using this property in the above equation yields:

    \begin{align*} E[s] &= \frac{\sum_{i=1}^N E[x_i^2]}{N} - \frac{E[(\sum_{i=1}^N x_i)^2]}{N^2} \\ &= s + \mu^2 - \frac{E[(\sum_{i=1}^N x_i)^2]}{N^2} \\ &= s + \mu^2 - \frac{E[\sum_{i=1}^N x_i^2 + \sum_i^N \sum_{j\neq i}^N x_i x_j]}{N^2} \\ &= s + \mu^2 - \frac{E[N(s+\mu^2) + \sum_i^N \sum_{j\neq i}^N x_i x_j]}{N^2} \\ &= s + \mu^2 - \frac{N(s+\mu^2) + \sum_i^N \sum_{j\neq i}^N E[x_i] E[x_j]}{N^2} \\ &= s + \mu^2 - \frac{N(s+\mu^2) + N(N-1)\mu^2}{N^2} \\ &= s + \mu^2 - \frac{N(s+\mu^2) + N^2\mu^2 -N\mu^2}{N^2} \\ &= s + \mu^2 - \frac{s+\mu^2 + N\mu^2 -\mu^2}{N} \\ &= s + \mu^2 - \frac{s}{N} - \frac{\mu^2}{N} - \mu^2 + \frac{\mu^2}{N}\\ &= s - \frac{s}{N}\\ &= s \left( 1 - \frac{1}{N} \right)\\ &= s \left(\frac{N-1}{N} \right) \end{align*}

Since clearly E[s] \neq \hat{s}, this shows that estimator for the variance of the distribution is no longer unbiassed. In fact, this estimator on average underestimates the true variance with a factor \frac{N-1}{N}. As the number of samples approaches infinity (N \rightarrow \infty), this bias converges to zero. For small sample sets however, the bias is signification and should be eliminated.

Fixing the bias

Since the bias is merely a factor, we can eliminate it by scaling the biased estimator s defined by equation (10) by the inverse of the bias. We therefore define a new, unbiased estimate s\prime as follows:

    \begin{align*} s\prime &= \left ( \frac{N-1}{N} \right )^{-1} s\\ s\prime &= \left ( \frac{N-1}{N} \right )^{-1} \frac{1}{N}\sum_{i=1}^N(x_i - \mu)^2\\ s\prime &= \left ( \frac{N}{N-1} \right ) \frac{1}{N}\sum_{i=1}^N(x_i - \mu)^2\\ s\prime &= \frac{1}{N-1}\sum_{i=1}^N(x_i - \mu)^2\\ \end{align*}

This estimator is now unbiassed and indeed resembles the traditional formula to calculate the variance, where we divide by N-1 instead of N. However, note that the resulting estimator is no longer the minimum variance estimator, but it is the estimator with the minimum variance amongst all unbiased estimators. If we divide by N, then the estimator is biassed, and if we divide by N-1, the estimator is not the minimum variance estimator. However, in general having a biased estimator is much worse than having a slightly higher variance estimator. Therefore, if the mean of the population is unknown, division by N-1 should be used instead of division by N.

Conclusion

In this article, we showed where the usual formulas for calculating the mean and the variance of normally distributed data come from. Furthermore, we have proven that the normalization factor in the variance estimator formula should be \frac{1}{N} if the true mean of the population is known, and should be \frac{1}{N-1} if the mean itself also has to be estimated.

If you’re new to this blog, don’t forget to subscribe, or follow me on twitter!

JOIN MY NEWSLETTER
Receive my newsletter to get notified when new articles and code snippets become available on my blog!
We all hate spam. Your email address will not be sold or shared with anyone else.

The post Why divide the sample variance by N-1? appeared first on Computer vision for dummies.

]]>
https://www.visiondummy.com/2014/03/divide-variance-n-1/feed/ 18
What are eigenvectors and eigenvalues? https://www.visiondummy.com/2014/03/eigenvalues-eigenvectors/ https://www.visiondummy.com/2014/03/eigenvalues-eigenvectors/#comments Wed, 05 Mar 2014 14:44:53 +0000 http://www.visiondummy.com/?p=111 Eigenvectors and eigenvalues have many important applications in computer vision and machine learning in general. Well known examples are PCA (Principal Component Analysis) for dimensionality reduction or EigenFaces for face recognition. An interesting use of eigenvectors and eigenvalues is also illustrated in my post about error ellipses. Furthermore, eigendecomposition forms the base of the geometric [...]

The post What are eigenvectors and eigenvalues? appeared first on Computer vision for dummies.

]]>
Introduction

Eigenvectors and eigenvalues have many important applications in computer vision and machine learning in general. Well known examples are PCA (Principal Component Analysis) for dimensionality reduction or EigenFaces for face recognition. An interesting use of eigenvectors and eigenvalues is also illustrated in my post about error ellipses. Furthermore, eigendecomposition forms the base of the geometric interpretation of covariance matrices, discussed in an more recent post. In this article, I will provide a gentle introduction into this mathematical concept, and will show how to manually obtain the eigendecomposition of a 2D square matrix.

An eigenvector is a vector whose direction remains unchanged when a linear transformation is applied to it. Consider the image below in which three vectors are shown. The green square is only drawn to illustrate the linear transformation that is applied to each of these three vectors.

eigenvectors

Eigenvectors (red) do not change direction when a linear transformation (e.g. scaling) is applied to them. Other vectors (yellow) do.

The transformation in this case is a simple scaling with factor 2 in the horizontal direction and factor 0.5 in the vertical direction, such that the transformation matrix A is defined as:

A=\begin{bmatrix} 2 & 0 \\ 0 & 0.5 \end{bmatrix}.

A vector \vec{v}=(x,y) is then scaled by applying this transformation as \vec{v}\prime = A\vec{v}. The above figure shows that the direction of some vectors (shown in red) is not affected by this linear transformation. These vectors are called eigenvectors of the transformation, and uniquely define the square matrix A. This unique, deterministic relation is exactly the reason that those vectors are called ‘eigenvectors’ (Eigen means ‘specific’ in German).

In general, the eigenvector \vec{v} of a matrix A is the vector for which the following holds:

(1)   \begin{equation*} A \vec{v} = \lambda \vec{v} \end{equation*}

where \lambda is a scalar value called the ‘eigenvalue’. This means that the linear transformation A on vector \vec{v} is completely defined by \lambda.

We can rewrite equation (1) as follows:

(2)   \begin{eqnarray*} A \vec{v} - \lambda \vec{v} = 0 \\  \Rightarrow \vec{v} (A - \lambda I) = 0, \end{eqnarray*}

where I is the identity matrix of the same dimensions as A.

However, assuming that \vec{v} is not the null-vector, equation (2) can only be defined if (A - \lambda I) is not invertible. If a square matrix is not invertible, that means that its determinant must equal zero. Therefore, to find the eigenvectors of A, we simply have to solve the following equation:

(3)   \begin{equation*}  Det(A - \lambda I) = 0. \end{equation*}

In the following sections we will determine the eigenvectors and eigenvalues of a matrix A, by solving equation (3). Matrix A in this example, is defined by:

(4)   \begin{equation*} A = \begin{bmatrix} 2 & 3 \\ 2 & 1 \end{bmatrix}. \end{equation*}

Calculating the eigenvalues

To determine the eigenvalues for this example, we substitute A in equation (3) by equation (4) and obtain:

(5)   \begin{equation*} Det\begin{pmatrix}2-\lambda&3\\2&1-\lambda\end{pmatrix}=0. \end{equation*}

Calculating the determinant gives:

(6)   \begin{align*} &(2-\lambda)(1-\lambda) - 6 = 0\\ \Rightarrow &2 - 2 \lambda - \lambda - \lambda^2 -6 = 0\\ \Rightarrow &{\lambda}^2 - 3 \lambda -4 = 0. \end{align*}

To solve this quadratic equation in \lambda, we find the discriminant:

    \begin{equation*} D = b^2 -4ac = (-3)^2 -4*1*(-4) = 9+16 = 25. \end{equation*}

Since the discriminant is strictly positive, this means that two different values for \lambda exist:

(7)   \begin{align*}  \lambda _1 &= \frac{-b - \sqrt{D}}{2a} = \frac{3-5}{2} = -1,\\ \lambda _2 &= \frac{-b + \sqrt{D}}{2a} = \frac{3+5}{2} = 4. \end{align*}

We have now determined the two eigenvalues \lambda_1 and \lambda_2. Note that a square matrix of size N \times N always has exactly N eigenvalues, each with a corresponding eigenvector. The eigenvalue specifies the size of the eigenvector.

Calculating the first eigenvector

We can now determine the eigenvectors by plugging the eigenvalues from equation (7) into equation (1) that originally defined the problem. The eigenvectors are then found by solving this system of equations.

We first do this for eigenvalue \lambda_1, in order to find the corresponding first eigenvector:

    \begin{equation*} \begin{bmatrix}2&3\\2&1\end{bmatrix} \begin{bmatrix}x_{11}\\x_{12}\end{bmatrix} = -1 \begin{bmatrix}x_{11}\\x_{12}\end{bmatrix}. \end{equation*}

Since this is simply the matrix notation for a system of equations, we can write it in its equivalent form:

(8)   \begin{eqnarray*} \left\{ \begin{array}{lr} 2x_{11} + 3x_{12} = -x_{11}\\ 2x_{11} + x_{12} = -x_{12} \end{array} \right. \end{eqnarray*}

and solve the first equation as a function of x_{12}, resulting in:

(9)   \begin{equation*}  x_{11} = -x_{12}. \end{equation*}

Since an eigenvector simply represents an orientation (the corresponding eigenvalue represents the magnitude), all scalar multiples of the eigenvector are vectors that are parallel to this eigenvector, and are therefore equivalent (If we would normalize the vectors, they would all be equal). Thus, instead of further solving the above system of equations, we can freely chose a real value for either x_{11} or x_{12}, and determine the other one by using equation (9).

For this example, we arbitrarily choose x_{12} = 1, such that x_{11}=-1. Therefore, the eigenvector that corresponds to eigenvalue \lambda_1 = -1 is

(10)   \begin{equation*} \vec{v}_1 = \begin{bmatrix} -1 \\ 1 \end{bmatrix}. \end{equation*}

Calculating the second eigenvector

Calculations for the second eigenvector are similar to those needed for the first eigenvector;
We now substitute eigenvalue \lambda_2=4 into equation (1), yielding:

(11)   \begin{equation*} \begin{bmatrix}2&3\\2&1\end{bmatrix} \begin{bmatrix}x_{21}\\x_{22}\end{bmatrix} = 4 * \begin{bmatrix}x_{21}\\x_{22}\end{bmatrix}. \end{equation*}

Written as a system of equations, this is equivalent to:

(12)   \begin{eqnarray*} \left\{ \begin{array}{lr} 2x_{21} + 3x_{22} = 4x_{21}\\ 2x_{21} + x_{22} = 4x_{22} \end{array} \right. \end{eqnarray*}

Solving the first equation as a function of x_{21} resuls in:

(13)   \begin{equation*} x_{22} = \frac{3}{2}x_{21} \end{equation*}

We then arbitrarily choose x_{21}=2, and find x_{22}=3. Therefore, the eigenvector that corresponds to eigenvalue \lambda_2 = 4 is

(14)   \begin{equation*} \vec{v}_2 = \begin{bmatrix} 3 \\ 2 \end{bmatrix}. \end{equation*}

Conclusion

In this article we reviewed the theoretical concepts of eigenvectors and eigenvalues. These concepts are of great importance in many techniques used in computer vision and machine learning, such as dimensionality reduction by means of PCA, or face recognition by means of EigenFaces.

If you’re new to this blog, don’t forget to subscribe, or follow me on twitter!

JOIN MY NEWSLETTER
Receive my newsletter to get notified when new articles and code snippets become available on my blog!
We all hate spam. Your email address will not be sold or shared with anyone else.

The post What are eigenvectors and eigenvalues? appeared first on Computer vision for dummies.

]]>
https://www.visiondummy.com/2014/03/eigenvalues-eigenvectors/feed/ 20