Date post: | 07-Jul-2018 |
Category: |
Documents |
Upload: | dango-daikazoku |
View: | 213 times |
Download: | 0 times |
of 51
8/18/2019 vision dummy.pdf
1/51
A geometric interpretation of the covariancematrix
Contents [hide] [hide]
1 Introduction
2 Eigendecomposition of a covariance matrix
3 Covariance matrix as a linear transformation
4 Conclusion
Introduction
In this article, we provide an intuitive, geometric interpretation of the covariance matrix, by
exploring the relation between linear transformations and the resulting data covariance. Most
textbooks explain the shape of data based on the concept of covariance matrices. Instead, we
take a backwards approach and explain the concept of covariance matrices based on the shape of
data.
In a previous article, we discussed the concept of variance, and provided a derivation and proof of
the well known formula to estimate the sample variance. Figure 1 was used in this article to show
that the standard deviation, as the square root of the variance, provides a measure of how much
the data is spread across the feature space.
http://www.visiondummy.com/2014/04/geometric-interpretation-covariance-matrix/http://www.visiondummy.com/2014/04/geometric-interpretation-covariance-matrix/http://www.visiondummy.com/2014/04/geometric-interpretation-covariance-matrix/#Introductionhttp://www.visiondummy.com/2014/04/geometric-interpretation-covariance-matrix/#Eigendecomposition_of_a_covariance_matrixhttp://www.visiondummy.com/2014/04/geometric-interpretation-covariance-matrix/#Covariance_matrix_as_a_linear_transformationhttp://www.visiondummy.com/2014/04/geometric-interpretation-covariance-matrix/#Conclusionhttp://www.visiondummy.com/2014/03/divide-variance-n-1/http://www.visiondummy.com/2014/03/divide-variance-n-1/http://www.visiondummy.com/2014/04/geometric-interpretation-covariance-matrix/#Conclusionhttp://www.visiondummy.com/2014/04/geometric-interpretation-covariance-matrix/#Covariance_matrix_as_a_linear_transformationhttp://www.visiondummy.com/2014/04/geometric-interpretation-covariance-matrix/#Eigendecomposition_of_a_covariance_matrixhttp://www.visiondummy.com/2014/04/geometric-interpretation-covariance-matrix/#Introductionhttp://www.visiondummy.com/2014/04/geometric-interpretation-covariance-matrix/http://www.visiondummy.com/2014/04/geometric-interpretation-covariance-matrix/
8/18/2019 vision dummy.pdf
2/51
8/18/2019 vision dummy.pdf
3/51
Figure 2. The diagnoal spread of the data is captured by the covariance.
For this data, we could calculate the variance in the x-direction and the
variance in the y-direction. However, the horizontal spread and the vertical spread of the
data does not explain the clear diagonal correlation. Figure 2 clearly shows that on average, if the
x-value of a data point increases, then also the y-value increases, resulting in a positive correlation.
This correlation can be captured by extending the notion of variance to what is called the
‘covariance’ of the data:
(2)
For 2D data, we thus obtain , , and . These four values can be
summarized in a matrix, called the covariance matrix:
(3)
If x is positively correlated with y, y is also positively correlated with x. In other words, we can state
that . Therefore, the covariance matrix is always a symmetric matrix with the
variances on its diagonal and the covariances off-diagonal. Two-dimensional normally distributed
http://www.visiondummy.com/wp-content/uploads/2014/04/transformeddata.png
8/18/2019 vision dummy.pdf
4/51
data is explained completely by its mean and its covariance matrix. Similarly,
a covariance matrix is used to capture the spread of three-dimensional data, and
a covariance matrix captures the spread of N-dimensional data.
Figure 3 illustrates how the overall shape of the data defines the covariance matrix:
Figure 3. The covariance matrix defines the shape of the data. Diagonal spread is captured by the
covariance, while axis-aligned spread is captured by the variance.
Eigendecomposition of a covariance matrix
In the next section, we will discuss how the covariance matrix can be interpreted as a linear
operator that transforms white data into the data we observed. However, before diving into the
technical details, it is important to gain an intuitive understanding of how eigenvectors and
eigenvalues uniquely define the covariance matrix, and therefore the shape of our data.
As we saw in figure 3, the covariance matrix defines both the spread (variance), and the
orientation (covariance) of our data. So, if we would like to represent the covariance matrix with a
http://www.visiondummy.com/wp-content/uploads/2014/04/covariances.png
8/18/2019 vision dummy.pdf
5/51
vector and its magnitude, we should simply try to find the vector that points into the direction of the
largest spread of the data, and whose magnitude equals the spread (variance) in this direction.
If we define this vector as , then the projection of our data onto this vector is obtained
as , and the variance of the projected data is . Since we are looking for the
vector that points into the direction of the largest variance, we should choose its components
such that the covariance matrix of the projected data is as large as possible. Maximizing
any function of the form with respect to , where is a normalized unit vector, can be
formulated as a so called Rayleigh Quotient. The maximum of such a Rayleigh Quotient is
obtained by setting equal to the largest eigenvector of matrix .
In other words, the largest eigenvector of the covariance matrix always points into the direction of
the largest variance of the data, and the magnitude of this vector equals the corresponding
eigenvalue. The second largest eigenvector is always orthogonal to the largest eigenvector, and
points into the direction of the second largest spread of the data.
Now let’s have a look at some examples. In an earlier article we saw that a linear transformation
matrix is completely defined by its eigenvectors and eigenvalues. Applied to the covariancematrix, this means that:
(4)
where is an eigenvector of , and is the corresponding eigenvalue.
If the covariance matrix of our data is a diagonal matrix, such that the covariances are zero, then
this means that the variances must be equal to the eigenvalues . This is illustrated by figure 4,
where the eigenvectors are shown in green and magenta, and where the eigenvalues clearly
equal the variance components of the covariance matrix.
http://en.wikipedia.org/wiki/Rayleigh_quotienthttp://www.visiondummy.com/2014/03/eigenvalues-eigenvectors/http://www.visiondummy.com/2014/03/eigenvalues-eigenvectors/http://en.wikipedia.org/wiki/Rayleigh_quotient
8/18/2019 vision dummy.pdf
6/51
Figure 4. Eigenvectors of a covariance matrix
However, if the covariance matrix is not diagonal, such that the covariances are not zero, then the
situation is a little more complicated. The eigenvalues still represent the variance magnitude in the
direction of the largest spread of the data, and the variance components of the covariance matrix
still represent the variance magnitude in the direction of the x-axis and y-axis. But since the data is
not axis aligned, these values are not the same anymore as shown by figure 5.
Figure 5. Eigenvalues versus variance
By comparing figure 5 with figure 4, it becomes clear that the eigenvalues represent the variance
of the data along the eigenvector directions, whereas the variance components of the covariance
matrix represent the spread along the axes. If there are no covariances, then both values are
equal.
Covariance matrix as a linear transformation
http://www.visiondummy.com/wp-content/uploads/2014/04/eigenvectors_covariance.pnghttp://www.visiondummy.com/wp-content/uploads/2014/04/eigenvectors.png
8/18/2019 vision dummy.pdf
7/51
Now let’s forget about covariance matrices for a moment. Each of the examples in figure 3 can
simply be considered to be a linearly transformed instance of figure 6:
Figure 6. Data with unit covariance matrix is called white data.
Let the data shown by figure 6 be , then each of the examples shown by figure 3 can be
obtained by linearly transforming :
(5)
where is a transformation matrix consisting of a rotation matrix and a scaling matrix :
(6)
These matrices are defined as:
(7)
where is the rotation angle, and:
http://www.visiondummy.com/wp-content/uploads/2014/04/whiteneddata.png
8/18/2019 vision dummy.pdf
8/51
(8)
where and are the scaling factors in the x direction and the y direction respectively.
In the following paragraphs, we will discuss the relation between the covariance matrix , and the
linear transformation matrix .
Let’s start with unscaled (scale equals 1) and unrotated data. In statistics this is often refered to as
‘white data’ because its samples are drawn from a standard normal distribution and therefore
correspond to white (uncorrelated) noise:
Figure 7. White data is data with a unit covariance matrix.
The covariance matrix of this ‘white’ data equals the identity matrix, such that the variances and
standard deviations equal 1 and the covariance equals zero:
(9)
http://www.visiondummy.com/wp-content/uploads/2014/04/whiteneddata.png
8/18/2019 vision dummy.pdf
9/51
Now let’s scale the data in the x -direction with a factor 4:
(10)
The data now looks as follows:
Figure 8. Variance in the x-direction results in a horizontal scaling.
The covariance matrix of is now:
(11)
Thus, the covariance matrix of the resulting data is related to the linear
transformation that is applied to the original data as follows: , where
(12)
http://www.visiondummy.com/wp-content/uploads/2014/04/stretcheddata.png
8/18/2019 vision dummy.pdf
10/51
However, although equation (12) holds when the data is scaled in the x and y direction, the
question rises if it also holds when a rotation is applied. To investigate the relation between the
linear transformation matrix and the covariance matrix in the general case, we will therefore
try to decompose the covariance matrix into the product of rotation and scaling matrices.
As we saw earlier, we can represent the covariance matrix by its eigenvectors and eigenvalues:
(13)
where is an eigenvector of , and is the corresponding eigenvalue.
Equation (13) holds for each eigenvector-eigenvalue pair of matrix . In the 2D case, we obtain
two eigenvectors and two eigenvalues. The system of two equations defined by equation (13) can
be represented efficiently using matrix notation:
(14)
where is the matrix whose columns are the eigenvectors of and is the diagonal matrix
whose non-zero elements are the corresponding eigenvalues.
This means that we can represent the covariance matrix as a function of its eigenvectors and
eigenvalues:
(15)
Equation (15) is called the eigendecomposition of the covariance matrix and can be obtained
using a Singular Value Decomposition algorithm. Whereas the eigenvectors represent the
directions of the largest variance of the data, the eigenvalues represent the magnitude of this
variance in those directions. In other words, represents a rotation matrix, while represents
a scaling matrix. The covariance matrix can thus be decomposed further as:
(16)
where is a rotation matrix and is a scaling matrix.
http://www.visiondummy.com/2014/04/geometric-interpretation-covariance-matrix/#id537686066http://www.visiondummy.com/2014/04/geometric-interpretation-covariance-matrix/#id3483335494http://www.visiondummy.com/2014/04/geometric-interpretation-covariance-matrix/#id3483335494http://www.visiondummy.com/2014/04/geometric-interpretation-covariance-matrix/#id2430180844https://en.wikipedia.org/wiki/Singular_value_decompositionhttps://en.wikipedia.org/wiki/Singular_value_decompositionhttp://www.visiondummy.com/2014/04/geometric-interpretation-covariance-matrix/#id2430180844http://www.visiondummy.com/2014/04/geometric-interpretation-covariance-matrix/#id3483335494http://www.visiondummy.com/2014/04/geometric-interpretation-covariance-matrix/#id3483335494http://www.visiondummy.com/2014/04/geometric-interpretation-covariance-matrix/#id537686066
8/18/2019 vision dummy.pdf
11/51
In equation (6) we defined a linear transformation . Since is a diagonal scaling
matrix, . Furthermore, since is an orthogonal matrix, .
Therefore, . The covariance matrix can thus be written as:
(17)
In other words, if we apply the linear transformation defined by to the original white
data shown by figure 7, we obtain the rotated and scaled data with covariance
matrix . This is illustrated by figure 10:
Figure 10. The covariance matrix represents a linear transformation of the original data.
The colored arrows in figure 10 represent the eigenvectors. The largest eigenvector, i.e. the
eigenvector with the largest corresponding eigenvalue, always points in the direction of the largest
variance of the data and thereby defines its orientation. Subsequent eigenvectors are always
orthogonal to the largest eigenvector due to the orthogonality of rotation matrices.
Conclusion
In this article we showed that the covariance matrix of observed data is directly related to a linear
transformation of white, uncorrelated data. This linear transformation is completely defined by the
eigenvectors and eigenvalues of the data. While the eigenvectors represent the rotation matrix,
the eigenvalues correspond to the square of the scaling factor in each dimension.
http://www.visiondummy.com/2014/04/geometric-interpretation-covariance-matrix/#id1585768567http://www.visiondummy.com/wp-content/uploads/2014/04/lineartrans.pnghttp://www.visiondummy.com/2014/04/geometric-interpretation-covariance-matrix/#id1585768567
8/18/2019 vision dummy.pdf
12/51
How to draw a covariance error ellipse?Contents [hide ] [hide ]
1 Introduction 2 Axis-aligned confidence ellipses 3 Arbitrary confidence ellipses 4 Source Code 5 Conclusion
IntroductionIn this post, I will show how to draw an error ellipse, a.k.a. confidence ellipse, for 2Dnormally distributed data. The error ellipse represents an iso-contour of the Gaussiandistribution, and allows you to visualize a 2D confidence interval. The following figureshows a 95% confidence ellipse for a set of 2D normally distributed data samples. Thisconfidence ellipse defines the region that contains 95% of all samples that can be drawnfrom the underlying Gaussian distribution.
Figure 1. 2D confidence ellipse for normally distributed data
In the next sections we will discuss how to obtain confidence ellipses for differentconfidence values (e.g. 99% confidence interval), and we will show how to plot theseellipses using Matlab or C++ code.
Axis-aligned confidence ellipsesBefore deriving a general methodology to obtain an error ellipse, let’s have a look at thespecial case where the major axis of the ellipse is aligned with the X-axis, as shown by thefollowing figure:
http://www.visiondummy.com/2014/04/draw-error-ellipse-representing-covariance-matrix/http://www.visiondummy.com/2014/04/draw-error-ellipse-representing-covariance-matrix/http://www.visiondummy.com/2014/04/draw-error-ellipse-representing-covariance-matrix/http://www.visiondummy.com/2014/04/draw-error-ellipse-representing-covariance-matrix/http://www.visiondummy.com/2014/04/draw-error-ellipse-representing-covariance-matrix/http://www.visiondummy.com/2014/04/draw-error-ellipse-representing-covariance-matrix/http://www.visiondummy.com/2014/04/draw-error-ellipse-representing-covariance-matrix/#Introductionhttp://www.visiondummy.com/2014/04/draw-error-ellipse-representing-covariance-matrix/#Introductionhttp://www.visiondummy.com/2014/04/draw-error-ellipse-representing-covariance-matrix/#Axis-aligned_confidence_ellipseshttp://www.visiondummy.com/2014/04/draw-error-ellipse-representing-covariance-matrix/#Axis-aligned_confidence_ellipseshttp://www.visiondummy.com/2014/04/draw-error-ellipse-representing-covariance-matrix/#Arbitrary_confidence_ellipseshttp://www.visiondummy.com/2014/04/draw-error-ellipse-representing-covariance-matrix/#Arbitrary_confidence_ellipseshttp://www.visiondummy.com/2014/04/draw-error-ellipse-representing-covariance-matrix/#Source_Codehttp://www.visiondummy.com/2014/04/draw-error-ellipse-representing-covariance-matrix/#Source_Codehttp://www.visiondummy.com/2014/04/draw-error-ellipse-representing-covariance-matrix/#Conclusionhttp://www.visiondummy.com/2014/04/draw-error-ellipse-representing-covariance-matrix/#Conclusionhttp://www.visiondummy.com/wp-content/uploads/2014/04/error_ellipse1.pnghttp://www.visiondummy.com/2014/04/draw-error-ellipse-representing-covariance-matrix/#Conclusionhttp://www.visiondummy.com/2014/04/draw-error-ellipse-representing-covariance-matrix/#Source_Codehttp://www.visiondummy.com/2014/04/draw-error-ellipse-representing-covariance-matrix/#Arbitrary_confidence_ellipseshttp://www.visiondummy.com/2014/04/draw-error-ellipse-representing-covariance-matrix/#Axis-aligned_confidence_ellipseshttp://www.visiondummy.com/2014/04/draw-error-ellipse-representing-covariance-matrix/#Introductionhttp://www.visiondummy.com/2014/04/draw-error-ellipse-representing-covariance-matrix/http://www.visiondummy.com/2014/04/draw-error-ellipse-representing-covariance-matrix/
8/18/2019 vision dummy.pdf
13/51
Figure 2. Confidence ellipse for uncorrelated Gaussian data
The above figure illustrates that the angle of the ellipse is determined by the covariance ofthe data. In this case, the covariance is zero, such that the data is uncorrelated, resulting inan axis-aligned error ellipse.
Table 1. Covariance matrix of the data shown in Figure 2
8.4213 0
0 0.9387
Furthermore, it is clear that the magnitudes of the ellipse axes depend on the variance ofthe data. In our case, the largest variance is in the direction of the X-axis, whereas thesmallest variance lies in the direction of the Y-axis.
In general, the equation of an axis-aligned ellipse with a major axis of length and aminor axis of length , centered at the origin, is defined by the following equation:
(1)
In our case, the length of the axes are defined by the standard deviations and ofthe data such that the equation of the error ellipse becomes:
(2)
http://www.visiondummy.com/2014/03/divide-variance-n-1/http://www.visiondummy.com/wp-content/uploads/2014/04/error_ellipse_axisaligned.pnghttp://www.visiondummy.com/2014/03/divide-variance-n-1/
8/18/2019 vision dummy.pdf
14/51
where defines the scale of the ellipse and could be any arbitrary number (e.g. s=1). Thequestion is now how to choose , such that the scale of the resulting ellipse represents achosen confidence level (e.g. a 95% confidence level corresponds to s=5.991).
Our 2D data is sampled from a multivariate Gaussian with zero covariance. This meansthat both the x-values and the y-values are normally distributed too. Therefore, the lefthand side of equation ( 2) actually represents the sum of squares of independent normally
distributed data samples. The sum of squared Gaussian data points is known to bedistributed according to a so called Chi-Square distribution. A Chi-Square distribution isdefined in terms of ‘degrees of freedom’, which represent the number of unknowns. In ourcase there are two unknowns, and therefore two degrees of freedom.
Therefore, we can easily obtain the probability that the above sum, and thus equals aspecific value by calculating the Chi-Square likelihood. In fact, since we are interested in aconfidence interval, we are looking for the probability that is less then or equal to aspecific value which can easily be obtained using the cumulative Chi-Square distribution.
As statisticians are lazy people, we usually don’t try to calculate this probability, but simply
look it up in a probability table: https://people.richland.edu/james/lecture/m170/tbl-chi.html.
For example, using this probability table we can easily find that, in the 2-degrees offreedom case:
Therefore, a 95% confidence interval corresponds to s=5.991. In other words, 95% of thedata will fall inside the ellipse defined as:
(3)
Similarly, a 99% confidence interval corresponds to s=9.210 and a 90% confidence intervalcorresponds to s=4.605.
The error ellipse show by figure 2 can therefore be drawn as an ellipse with a major axis
length equal to and the minor axis length to .
Arbitrary confidence ellipsesIn cases where the data is not uncorrelated, such that a covariance exists, the resultingerror ellipse will not be axis aligned. In this case, the reasoning of the above paragraphonly holds if we temporarily define a new coordinate system such that the ellipse becomesaxis-aligned, and then rotate the resulting ellipse afterwards.
In other words, whereas we calculated the variances and parallel to the x-axis andy-axis earlier, we now need to calculate these variances parallel to what will become themajor and minor axis of the confidence ellipse. The directions in which these variancesneed to be calculated are illustrated by a pink and a green arrow in figure 1.
http://www.visiondummy.com/2014/04/draw-error-ellipse-representing-covariance-matrix/#id2631188439https://en.wikipedia.org/wiki/Chi-squared_distributionhttps://people.richland.edu/james/lecture/m170/tbl-chi.htmlhttps://people.richland.edu/james/lecture/m170/tbl-chi.htmlhttps://en.wikipedia.org/wiki/Chi-squared_distributionhttp://www.visiondummy.com/2014/04/draw-error-ellipse-representing-covariance-matrix/#id2631188439
8/18/2019 vision dummy.pdf
15/51
Figure 1. 2D confidence ellipse for normally distributed data
These directions are actually the directions in which the data varies the most, and aredefined by the covariance matrix. The covariance matrix can be considered as a matrixthat linearly transformed some original data to obtain the currently observed data. In aprevious article about eigenvectors and eigenvalues we showed that the direction vectorsalong such a linear transformation are the eigenvectors of the transformation matrix.Indeed, the vectors shown by pink and green arrows in figure 1, are the eigenvectors of thecovariance matrix of the data, whereas the length of the vectors corresponds to theeigenvalues.
The eigenvalues therefore represent the spread of the data in the direction of theeigenvectors. In other words, the eigenvalues represent the variance of the data in thedirection of the eigenvectors. In the case of axis aligned error ellipses, i.e. when thecovariance equals zero, the eigenvalues equal the variances of the covariance matrix andthe eigenvectors are equal to the definition of the x-axis and y-axis. In the case of arbitrarycorrelated data, the eigenvectors represent the direction of the largest spread of the data,whereas the eigenvalues define how large this spread really is.
Thus, the 95% confidence ellipse can be defined similarly to the axis-aligned case, with themajor axis of length and the minor axis of length ,where and represent the eigenvalues of the covariance matrix.
To obtain the orientation of the ellipse, we simply calculate the angle of the largesteigenvector towards the x-axis:
(4)
http://www.visiondummy.com/2014/03/eigenvalues-eigenvectors/http://www.visiondummy.com/wp-content/uploads/2014/04/error_ellipse1.pnghttp://www.visiondummy.com/2014/03/eigenvalues-eigenvectors/
8/18/2019 vision dummy.pdf
16/51
where is the eigenvector of the covariance matrix that corresponds to the largesteigenvalue.
Based on the minor and major axis lengths and the angle between the major axis andthe x-axis, it becomes trivial to plot the confidence ellipse. Figure 3 shows error ellipses forseveral confidence values:
Confidence ellipses for normally distributed data
Source CodeMatlab source codeC++ source code (uses OpenCV)
Conclusion
In this article we showed how to obtain the error ellipse for 2D normally distributed data,according to a chosen confidence value. This is often useful when visualizing or analyzingdata and will be of interest in a future article about PCA.
http://www.visiondummy.com/wp-content/uploads/2014/04/error_ellipse.mhttp://www.visiondummy.com/wp-content/uploads/2014/04/error_ellipse.cpphttp://www.visiondummy.com/2014/05/feature-extraction-using-pca/http://www.visiondummy.com/wp-content/uploads/2014/04/error_ellipse_isocontours.pnghttp://www.visiondummy.com/2014/05/feature-extraction-using-pca/http://www.visiondummy.com/wp-content/uploads/2014/04/error_ellipse.cpphttp://www.visiondummy.com/wp-content/uploads/2014/04/error_ellipse.m
8/18/2019 vision dummy.pdf
17/51
Why divide the sample variance by N-1?Contents [hide ] [hide ]
1 Introduction 2 Minimum variance, unbiased estimators
o 2.1 Parameter bias o 2.2 Parameter variance
3 Maximum Likelihood estimation 4 Estimating the variance if the mean is known
o 4.1 Parameter estimation o 4.2 Performance evaluation
5 Estimating the variance if the mean is unknown o 5.1 Parameter estimation o 5.2 Performance evaluation o 5.3 Fixing the bias
6 Conclusion
Introduction In this article, we will derive the well known formulas for calculating themean and the variance of normally distributed data, in order to answer thequestion in the article’s title. However, for readers who are not interested inthe ‘why’ of this question but only in the ‘when’, the answe r is quite simple:
If you have to estimate both the mean and the variance of the data (which is
typically the case), then divide by N-1, such that the variance is obtained as:
If, on the other hand, the mean of the true population is known such that onlythe variance needs to be estimated, then divide by N, such that the variance is
obtained as:
Whereas the former is what you will typically need, an example of the latterwould be the estimation of the spread of white Gaussian noise. Since the mean
of white Gaussian noise is known to be zero, only the variance needs to beestimated in this case.
http://www.visiondummy.com/2014/03/divide-variance-n-1/http://www.visiondummy.com/2014/03/divide-variance-n-1/http://www.visiondummy.com/2014/03/divide-variance-n-1/http://www.visiondummy.com/2014/03/divide-variance-n-1/http://www.visiondummy.com/2014/03/divide-variance-n-1/http://www.visiondummy.com/2014/03/divide-variance-n-1/http://www.visiondummy.com/2014/03/divide-variance-n-1/#Introductionhttp://www.visiondummy.com/2014/03/divide-variance-n-1/#Introductionhttp://www.visiondummy.com/2014/03/divide-variance-n-1/#Minimum_variance_unbiased_estimatorshttp://www.visiondummy.com/2014/03/divide-variance-n-1/#Minimum_variance_unbiased_estimatorshttp://www.visiondummy.com/2014/03/divide-variance-n-1/#Parameter_biashttp://www.visiondummy.com/2014/03/divide-variance-n-1/#Parameter_biashttp://www.visiondummy.com/2014/03/divide-variance-n-1/#Parameter_variancehttp://www.visiondummy.com/2014/03/divide-variance-n-1/#Parameter_variancehttp://www.visiondummy.com/2014/03/divide-variance-n-1/#Maximum_Likelihood_estimationhttp://www.visiondummy.com/2014/03/divide-variance-n-1/#Maximum_Likelihood_estimationhttp://www.visiondummy.com/2014/03/divide-variance-n-1/#Estimating_the_variance_if_the_mean_is_knownhttp://www.visiondummy.com/2014/03/divide-variance-n-1/#Estimating_the_variance_if_the_mean_is_knownhttp://www.visiondummy.com/2014/03/divide-variance-n-1/#Parameter_estimationhttp://www.visiondummy.com/2014/03/divide-variance-n-1/#Parameter_estimationhttp://www.visiondummy.com/2014/03/divide-variance-n-1/#Performance_evaluationhttp://www.visiondummy.com/2014/03/divide-variance-n-1/#Performance_evaluationhttp://www.visiondummy.com/2014/03/divide-variance-n-1/#Estimating_the_variance_if_the_mean_is_unknownhttp://www.visiondummy.com/2014/03/divide-variance-n-1/#Estimating_the_variance_if_the_mean_is_unknownhttp://www.visiondummy.com/2014/03/divide-variance-n-1/#Parameter_estimation-2http://www.visiondummy.com/2014/03/divide-variance-n-1/#Parameter_estimation-2http://www.visiondummy.com/2014/03/divide-variance-n-1/#Performance_evaluation-2http://www.visiondummy.com/2014/03/divide-variance-n-1/#Performance_evaluation-2http://www.visiondummy.com/2014/03/divide-variance-n-1/#Fixing_the_biashttp://www.visiondummy.com/2014/03/divide-variance-n-1/#Fixing_the_biashttp://www.visiondummy.com/2014/03/divide-variance-n-1/#Conclusionhttp://www.visiondummy.com/2014/03/divide-variance-n-1/#Conclusionhttp://www.visiondummy.com/2014/03/divide-variance-n-1/#Conclusionhttp://www.visiondummy.com/2014/03/divide-variance-n-1/#Fixing_the_biashttp://www.visiondummy.com/2014/03/divide-variance-n-1/#Performance_evaluation-2http://www.visiondummy.com/2014/03/divide-variance-n-1/#Parameter_estimation-2http://www.visiondummy.com/2014/03/divide-variance-n-1/#Estimating_the_variance_if_the_mean_is_unknownhttp://www.visiondummy.com/2014/03/divide-variance-n-1/#Performance_evaluationhttp://www.visiondummy.com/2014/03/divide-variance-n-1/#Parameter_estimationhttp://www.visiondummy.com/2014/03/divide-variance-n-1/#Estimating_the_variance_if_the_mean_is_knownhttp://www.visiondummy.com/2014/03/divide-variance-n-1/#Maximum_Likelihood_estimationhttp://www.visiondummy.com/2014/03/divide-variance-n-1/#Parameter_variancehttp://www.visiondummy.com/2014/03/divide-variance-n-1/#Parameter_biashttp://www.visiondummy.com/2014/03/divide-variance-n-1/#Minimum_variance_unbiased_estimatorshttp://www.visiondummy.com/2014/03/divide-variance-n-1/#Introductionhttp://www.visiondummy.com/2014/03/divide-variance-n-1/http://www.visiondummy.com/2014/03/divide-variance-n-1/
8/18/2019 vision dummy.pdf
18/51
8/18/2019 vision dummy.pdf
19/51
If we now calculate the empirical mean by summing up all values and dividing by thenumber of observations, we have:
(1)
Usually we assume that the empirical mean is close to the actually unknown mean of thedistribution, and thus assume that the observed data is sampled from a Gaussiandistribution with mean . In this example, the actual mean of the distribution is 10, sothe empirical mean indeed is close to the actual mean.
The variance of the data is calculated as follows:
(2
Again, we usually assume that this empirical variance is close to the real and unknownvariance of underlying distribution. In this example, the real variance was 9, so indeed theempirical variance is close to the real variance.
The question at hand is now why the formulas used to calculate the empirical mean andthe empirical variance are correct. In fact, another often used formula to calculate thevariance, is defined as follows:
(3)
The only difference between equation (2) and (3) is that the former divides by N-1,whereas the latter divides by N. Both formulas are actually correct, but when to use whichone depends on the situation.
In the following sections, we will completely derive the formulas that best approximate theunknown variance and mean of a normal distribution, given a few samples from thisdistribution. We will show in which cases to divide the variance by N and in which cases tonormalize by N-1.
A formula that approximates a parameter (mean or variance) is called an estimator. In thefollowing, we will denote the real and unknown parameters of the distribution by and .The estimators, e.g. the empirical average and empirical variance, are denotedas and .
To find the optimal estimators, we first need an analytical expression for the likelihood ofobserving a specific data point , given the fact that the population is normally distributedwith a given mean and standard deviation . A normal distribution with known
parameters is usually denoted as . The likelihood function is then:
http://www.visiondummy.com/2014/03/divide-variance-n-1/#id2864676703http://www.visiondummy.com/2014/03/divide-variance-n-1/#id2627899374http://www.visiondummy.com/2014/03/divide-variance-n-1/#id2627899374http://www.visiondummy.com/2014/03/divide-variance-n-1/#id2864676703
8/18/2019 vision dummy.pdf
20/51
(4)
To calculate the mean and variance, we obviously need more than one sample from thisdistribution. In the following, let vector be a vector that contains allthe available samples (e.g. all the values from the example in table 1). If all these samplesare statistically independent, we can write their joint likelihood function as the sum of allindividual likelihoods:
(5)
Plugging equation (4) into equation (5) then yields an analytical expression for this jointprobability density function:
(6)
Equation ( 6) will be important in the next sections and will be used to derive the well knownexpressions for the estimators of the mean and the variance of a Gaussian distribution.
Minimum variance, unbiased estimatorsTo determine if an estimator is a ‘good’ estimator, we first need to define what a ‘good’estimator really is. The goodness of an estimator depends on two measures, namely itsbias and its variance (yes, we will talk about the variance of the mean-estimator and thevariance of the variance-estimator). Both measures are briefly discussed in this section.
Parameter bias
Imagine that we could obtain different (disjoint) subsets of the complete population. Inanalogy to our previous example, imagine that, apart from the data in Table 1, we alsohave a Table 2 and a Table 3 with different observations. Then a good estimator for themean, would be an estimator that on average would be equal to the real mean. Althoughwe can live with the idea that the empirical mean from one subset of data is not equal tothe real mean like in our example, a good estimator should make sure that the average of
http://www.visiondummy.com/2014/03/divide-variance-n-1/#id2562325792http://www.visiondummy.com/2014/03/divide-variance-n-1/#id537882371http://www.visiondummy.com/2014/03/divide-variance-n-1/#id1068849321http://www.visiondummy.com/2014/03/divide-variance-n-1/#id1068849321http://www.visiondummy.com/2014/03/divide-variance-n-1/#id537882371http://www.visiondummy.com/2014/03/divide-variance-n-1/#id2562325792
8/18/2019 vision dummy.pdf
21/51
the estimated means from all subsets is equal to the real mean. This constraint isexpressed mathematically by stating that the Expected Value of the estimator should equalthe real parameter value:
(7)
If the above conditions hold, then the estimators are called ‘unbiased estimators’. If theconditions do not hold, the estimators are said to be ‘biased’, since on average they willeither underestimate or overestimate the true value of the parameter.
Parameter variance
Unbiased estimators guarantee that on average they yield an estimate that equals the realparameter. However, this does not mean that each estimate is a good estimate. Forinstance, if the real mean is 10, an unbiased estimator could estimate the mean as 50 onone population subset and as -30 on another subset. The expected value of the estimatewould then indeed be 10, which equals the real parameter, but the quality of the estimatorclearly also depends on the spread of each estimate. An estimator that yields theestimates (10, 15, 5, 12, 8) for five different subsets of the population is unbiased just likean estimator that yields the estimates (50, -30, 100, -90, 10). However, all estimates fromthe first estimator are closer to the true value than those from the second estimator.
Therefore, a good estimator not only has a low bias, but also yields a low variance. Thisvariance is expressed as the mean squared error of the estimator:
A good estimator is therefore is a low bias, low variance estimator. The optimal estimator,if such estimator exists, is then the one that has no bias and a variance that is lower thanany other possible estimator. Such an estimator is called the minimum variance, unbiased
(MVU) estimator. In the next section, we will derive the analytical expressions for the meanand the variance estimators of a Gaussian distribution. We will show that the MVUestimator for the variance of a normal distribution requires us to divide the variance byunder certain assumptions, and requires us to divide by N-1 if these assumptions do nothold.
Maximum Likelihood estimation Although numerous techniques can be used to obtain an estimator of the parametersbased on a subset of the population data, the simplest of all is probably the maximumlikelihood approach.
https://en.wikipedia.org/wiki/Maximum_likelihoodhttps://en.wikipedia.org/wiki/Maximum_likelihoodhttps://en.wikipedia.org/wiki/Maximum_likelihoodhttps://en.wikipedia.org/wiki/Maximum_likelihood
8/18/2019 vision dummy.pdf
22/51
The probability of observing was defined by equation ( 6) as . If we fixand in this function, while letting vary, we obtain the Gaussian distribution as plottedby figure 1. However, we could also choose a fixed and let and/or vary. For
example, we can choose like in our previous example. We alsochoose a fixed , and we let vary. Figure 2 shows the plot of each differentvalue of for the distribution with the proposed fixed and :
Figure 2. This plot shows the likelihood of observing fixed data if the data is normallydistributed with a chosen, fixed , plotted against various values of a varying .
In the above figure, we calculated the likelihood by varying for afixed . Each point in the resulting curve represents the likelihood thatobservation is a sample from a Gaussian distribution with parameter . The parametervalue that corresponds to the highest likelihood is then most likely the parameter thatdefines the distribution our data originated from. Therefore, we can determine theoptimal by finding the maximum in this likelihood curve. In this example, the maximum
is at , such that the standard deviation is . Indeed if we wouldcalculate the variance in the traditional way, with a given , we would find that it isequal to 7.8:
Therefore, the formula to compute the variance based on the sample data is simplyderived by finding the peak of the maximum likelihood function. Furthermore, instead of
http://www.visiondummy.com/2014/03/divide-variance-n-1/#id1068849321http://www.visiondummy.com/wp-content/uploads/2014/03/likelihood.pnghttp://www.visiondummy.com/2014/03/divide-variance-n-1/#id1068849321
8/18/2019 vision dummy.pdf
23/51
fixing , we let both and vary at the same time. Finding both estimators thencorresponds to finding the maximum in a two-dimensional likelihood function.
To find the maximum of a function, we simply set its derivative to zero. If we want to findthe maximum of a function with two variables, we calculate the partial derivative towardseach of these variables and set both to zero. In the following, let be the optimalestimator for the population mean as obtained using the maximum likelihood method, and
let be the optimal estimator for the variance. To maximize the likelihood function wesimply calculate its (partial) derivatives and set them to zero as follows:
and
In the following paragraphs we will use this technique to obtain the MVU estimators of both and . We consider two cases:
The first case assumes that the true mean of the distribution is known. Therefore, weonly need to estimate the variance and the problem then corresponds to finding themaximum in a one-dimensional likelihood function, parameterized by . Although thissituation does not occur often in practice, it definitely has practical applications. Forinstance, if we know that a signal (e.g. the color value of a pixel in an image) should have aspecific value, but the signal has been polluted by white noise (Gaussian noise with zeromean), then the mean of the distribution is known and we only need to estimate thevariance.
The second case deals with the situation where both the true mean and the true varianceare unknown. This is the case you would encounter most and where you would obtain anestimate of the mean and the variance based on your sample data.
In the next paragraphs we will show that each case results in a different MVU estimator.More specific, the first case requires the variance estimator to be normalized by to beMVU, whereas the second case requires division by to be MVU.
8/18/2019 vision dummy.pdf
24/51
Estimating the variance if the mean is knownParameter estimation
If the true mean of the distribution is known, then the likelihood function is onlyparameterized on . Obtaining the maximum likelihood estimator then corresponds tosolving:
(8)
However, calculating the derivative of , defined by equation (6) is rather involveddue to the exponent in the function. In fact, it is much easier to maximize the log-likelihoodfunction instead of the likelihood function itself. Since the logarithm is a monotonousfunction, the maximum will be the same. Therefore, we solve the following probleminstead:
(9)
In the following we set to obtain a simpler notation. To find the maximum of thelog-likelihood function, we simply calculate the derivative of the logarithm of equation ( 6) and set it to zero:
http://www.visiondummy.com/2014/03/divide-variance-n-1/#id1068849321http://www.visiondummy.com/2014/03/divide-variance-n-1/#id1068849321http://www.visiondummy.com/2014/03/divide-variance-n-1/#id1068849321http://www.visiondummy.com/2014/03/divide-variance-n-1/#id1068849321
8/18/2019 vision dummy.pdf
25/51
It is clear that if , then the only possible solution to the above is:
(10)
Note that this maximum likelihood estimator for is indeed the traditional formula to
calculate the variance of normal data. The normalization factor is .
However, the maximum likelihood method does not guarantee to deliver an unbiasedestimator. On the other hand, if the obtained estimator is unbiased, then the maximum
likelihood method does guarantee that the estimator is also minimum variance and thusMVU. Therefore, we need to check if the estimator in equation (10) is unbiassed.
Performance evaluation
To check if the estimator defined by equation (10) is unbiassed, we need to check if thecondition of equation (7) holds, and thus if
To do this, we plug equation (10) into and write:
Furthermore, an important property of variance is that the true variance can be written
as such that . Using this propertyin the above equation yields:
http://www.visiondummy.com/2014/03/divide-variance-n-1/#id2920392421http://www.visiondummy.com/2014/03/divide-variance-n-1/#id2920392421http://www.visiondummy.com/2014/03/divide-variance-n-1/#id2404251598http://www.visiondummy.com/2014/03/divide-variance-n-1/#id2920392421https://en.wikipedia.org/wiki/Variance#Definitionhttps://en.wikipedia.org/wiki/Variance#Definitionhttp://www.visiondummy.com/2014/03/divide-variance-n-1/#id2920392421http://www.visiondummy.com/2014/03/divide-variance-n-1/#id2404251598http://www.visiondummy.com/2014/03/divide-variance-n-1/#id2920392421http://www.visiondummy.com/2014/03/divide-variance-n-1/#id2920392421
8/18/2019 vision dummy.pdf
26/51
Since , the condition shown by equation (7) holds, and therefore the obtainedestimator for the variance of the data is unbiassed. Furthermore, because the maximumlikelihood method guarantees that an unbiased estimator is also minimum variance (MVU),this means that no other estimator exists that can do better than the one obtained here.Therefore, we have to divide by instead of while calculating the variance ofnormally distributed data, if the true mean of the underlying distribution is known.
Estimating the variance if the mean is unknownParameter estimation
In the previous section, the true mean of the distribution was known, such that we only hadto find an estimator for the variance of the data. However, if the true mean is not known,
then an estimator has to be found for the mean too. Furthermore, this mean estimate isused by the variance estimator. As a result, we will show that the earlier obtained estimatorfor the variance is no longer unbiassed. Furthermore, we will show that we can ‘unbias’ theestimator in this case by dividing by instead of by , which slightly increases thevariance of the estimator.
As before, we use the maximum likelihood method to obtain the estimators based on thelog-likelihood function. We first find the ML estimator for :
http://www.visiondummy.com/2014/03/divide-variance-n-1/#id2404251598http://www.visiondummy.com/2014/03/divide-variance-n-1/#id2404251598
8/18/2019 vision dummy.pdf
27/51
If , then it is clear that the above equation only has a solution if:
(11)
Note that indeed this is the well known formula to calculate the mean of a distribution. Although we all knew this formula, we now proved that it is the maximum likelihoodestimator for the true and unknown mean of a normal distribution. For now, we will justassume that the estimator that we found earlier for the variance , defined by equation (10) ,is still the MVU variance estimator. In the next section however, we will show that thisestimator is no longer unbiased now.
Performance evaluation
To check if the estimator for the true mean is unbiassed, we have to make sure thatthe condition of equation (7) holds:
http://www.visiondummy.com/2014/03/divide-variance-n-1/#id2920392421http://www.visiondummy.com/2014/03/divide-variance-n-1/#id2404251598http://www.visiondummy.com/2014/03/divide-variance-n-1/#id2404251598http://www.visiondummy.com/2014/03/divide-variance-n-1/#id2920392421
8/18/2019 vision dummy.pdf
28/51
Since , this means that the obtained estimator for the mean of the distribution is
unbiassed. Since the maximum likelihood method guarantees to deliver the minimumvariance estimator if the estimator is unbiassed, we proved that is the MVU estimator ofthe mean.
To check if the earlier found estimator for the variance is still unbiassed if it is based onthe empirical mean instead of the true mean , we simply plug the obtainedestimator into the earlier derived estimator of equation
(10) :
To check if the estimator is still unbiased, we now need to check again if the condition ofequation (7) holds:
http://www.visiondummy.com/2014/03/divide-variance-n-1/#id2920392421http://www.visiondummy.com/2014/03/divide-variance-n-1/#id2404251598http://www.visiondummy.com/2014/03/divide-variance-n-1/#id2404251598http://www.visiondummy.com/2014/03/divide-variance-n-1/#id2920392421
8/18/2019 vision dummy.pdf
29/51
As mentioned in the previous section, an important property of variance is that the true
variance can be written as such
that . Using this property in the above equation yields:
Since clearly , this shows that estimator for the variance of the distribution is nolonger unbiassed. In fact, this estimator on average underestimates the true variance with
a factor . As the number of samples approaches infinity ( ), this biasconverges to zero. For small sample sets however, the bias is signification and should beeliminated.
8/18/2019 vision dummy.pdf
30/51
Fixing the bias
Since the bias is merely a factor, we can eliminate it by scaling the biased estimatordefined by equation ( 10) by the inverse of the bias. We therefore define a new, unbiased
estimate as follows:
This estimator is now unbiassed and indeed resembles the traditional formula to calculatethe variance, where we divide by instead of . However, note that the resultingestimator is no longer the minimum variance estimator, but it is the estimator with theminimum variance amongst all unbiased estimators. If we divide by , then the estimatoris biassed, and if we divide by , the estimator is not the minimum variance estimator.However, in general having a biased estimator is much worse than having a slightly highervariance estimator. Therefore, if the mean of the population is unknown, divisionby should be used instead of division by .
ConclusionIn this article, we showed where the usual formulas for calculating the mean and thevariance of normally distributed data come from. Furthermore, we have proven that the
normalization factor in the variance estimator formula should be if the true mean of the
population is known, and should be if the mean itself also has to be estimated.
http://www.visiondummy.com/2014/03/divide-variance-n-1/#id2920392421http://www.visiondummy.com/2014/03/divide-variance-n-1/#id2920392421
8/18/2019 vision dummy.pdf
31/51
8/18/2019 vision dummy.pdf
32/51
light. As a result, the R, G, B components of a pixel are statistically correlated. Therefore, simply eliminatingthe R component from the feature vector, also implicitly removes information about the G and B channels.
In other words, before eliminating features, we would like to transform the complete feature space suchthat the underlying uncorrelated components are obtained.
Consider the following example of a 2D feature space:
Figure 1 2D Correlated data with eigenvectors shown in color.
The features and , illustrated by figure 1, are clearly correlated. In fact, their covariance matrix is:
In an earlier article we discussed the geometric interpretation of the covariance matrix. We saw that thecovariance matrix can be decomposed as a sequence of rotation and scaling operations on white,
uncorrelated data, where the rotation matrix is defined by the eigenvectors of this covariance matrix. Therefore, intuitively, it is easy to see that the data shown in figure 1 can be decorrelated by rotating
each data point such that the eigenvectors become the new reference axes:
(1)
http://www.visiondummy.com/2014/04/geometric-interpretation-covariance-matrix/http://www.visiondummy.com/2014/03/eigenvalues-eigenvectors/http://www.visiondummy.com/wp-content/uploads/2014/05/correlated_2d.pnghttp://www.visiondummy.com/2014/03/eigenvalues-eigenvectors/http://www.visiondummy.com/2014/04/geometric-interpretation-covariance-matrix/
8/18/2019 vision dummy.pdf
33/51
Figure 2. 2D Uncorrelated data with eigenvectors shown in color.
The covariance matrix of the resulting data is now diagonal, meaning that the new axes are uncorrelated:
In fact, the original data used in this example and shown by figure 1 was generated by linearly combining
two 1D Gaussian feature vectors and as follows:
Since the features and are linear combinations of some unknown underlying
components and , directly eliminating either or as a feature would have removed someinformation from both and . Instead, rotating the data by the eigenvectors of its covariance matrix,
allowed us to directly recover the independent components and (up to a scaling factor). This canbe seen as follows: The eigenvectors of the covariance matrix of the original data are (each columnrepresents an eigenvector):
http://www.visiondummy.com/wp-content/uploads/2014/05/uncorrelated_2d.png
8/18/2019 vision dummy.pdf
34/51
The first thing to notice is that in this case is a rotation matrix, corresponding to a rotation of 45 degrees
(cos(45)=0.7071), which indeed is evident from figure 1. Secondly, treating as a linear transformation
matrix results in a new coordinate system, such that each new feature and is expressed as a linear
combination of the original features and :
(2)
and
(3)
In other words, decorrelation of the feature space corresponds to the recovery of the unknown,
uncorrelated components and of the data (up to an unknown scaling factor if the transformationmatrix was not orthogonal). Once these components have been recovered, it is easy to reduce the
dimensionality of the feature space by simply eliminating either or .
In the above example we started with a two-dimensional problem. If we would like to reduce thedimensionality, the question remains whether to eliminate (and thus ) or (and thus ).Although this choice could depend on many factors such as the separability of the data in case of
classification problems, PCA simply assumes that the most interesting feature is the one with the largest
8/18/2019 vision dummy.pdf
35/51
variance or spread. This assumption is based on an information theoretic point of view, since the dimensionwith the largest variance corresponds to the dimension with the largest entropy and thus encodes the
most information. The smallest eigenvectors will often simply represent noise components, whereas thelargest eigenvectors often correspond to the principal components that define the data.
Dimensionality reduction by means of PCA is then accomplished simply by projecting the data onto the
largest eigenvectors of its covariance matrix. For the above example, the resulting 1D feature space isillustrated by figure 3:
Figure 3. PCA: 2D data projected onto its largest eigenvector.
Obivously, the above example easily generalizes to higher dimensional feature spaces. For instance, in thethree-dimensional case, we can either project the data onto the plane defined by the two largesteigenvectors to obtain a 2D feature space, or we can project it onto the largest eigenvector to obtain a 1D
feature space. This is illustrated by figure 4:
http://www.visiondummy.com/wp-content/uploads/2014/05/uncorrelated_1d.png
8/18/2019 vision dummy.pdf
36/51
8/18/2019 vision dummy.pdf
37/51
vector such that projecting the data onto this vector corresponds to a projection error that is lower thanthe projection error that would be obtained when projecting the data onto any other possible vector. The
question is then how to find this optimal vector.
Consider the example shown by figure 5. Three different projection vectors are shown, together with the
resulting 1D data. In the next paragraphs, we will discuss how to determine which projection vector
minimizes the projection error. Before searching for a vector that minimizes the projection error, we have todefine this error function.
Figure 5 Dimensionality reduction by projection onto a linear subspace
A well known method to fit a line to 2D data is least squares regression. Given the independentvariable and the dependent variable , the least squares regressor corresponds to the
line , such that the sum of the squared residual errors is
https://en.wikipedia.org/wiki/Least_squareshttp://www.visiondummy.com/wp-content/uploads/2014/05/projectionvectors.pnghttps://en.wikipedia.org/wiki/Least_squares
8/18/2019 vision dummy.pdf
38/51
minimized. In other words, if is treated as the independent variable, then the obtained
regressor is a linear function that can predict the dependent variable such that the squared error
is minimal. The resulting model is illustrated by the blue line in figure 5, and the error that isminimized is illustrated in figure 6.
Figure 6. Linear regression where x is the independent variable and y is the dependent variable,
corresponds to minimizing the vertical projection error.
However, in the context of feature extraction, one might wonder why we would define feature as theindependent variable and feature as the dependent variable. In fact, we could easily define as the
independent variable and find a linear function that predicts the dependent variable , such
that is minimized. This corresponds to minimization of the horizontal projectionerror and results in a different linear model as shown by figure 7:
http://www.visiondummy.com/wp-content/uploads/2014/05/y_regression.png
8/18/2019 vision dummy.pdf
39/51
Figure 7. Linear regression where y is the independent variable and x is the dependent variable,corresponds to minimizing the horizontal projection error.
Clearly, the choice of independent and dependent variables changes the resulting model, making ordinary
least squares regression an asymmetric regressor. The reason for this is that least squares regressionassumes the independent variable to be noise-free, whereas the dependent variable is assumed to be
noisy. However, in the case of classification, all features are usually noisy observations such thatneither or should be treated as independent. In fact, we would like to obtain a model thatminimizes both the horizontal and the vertical projection error simultaneously. This corresponds to finding
a model such that the orthogonal projection error is minimized as shown by figure 8.
http://www.visiondummy.com/wp-content/uploads/2014/05/xy_regression.pnghttp://www.visiondummy.com/wp-content/uploads/2014/05/x_regression.png
8/18/2019 vision dummy.pdf
40/51
Figure 8. Linear regression where both variables are independent corresponds to minimizing the orthogonalprojection error.
The resulting regression is called Total Least Squares regression or orthogonal regression, and assumes that
both variables are imperfect observations. An interesting observation is now that the obtained vector,
representing the projection direction that minimizes the orthogonal projection error, corresponds the the
largest principal component of the data:
Figure 9. The vector which the data can be projected unto with minimal orthogonal error corresponds to
the largest eigenvector of the covariance matrix of the data.
In other words, if we want to reduce the dimensionality by projecting the original data onto a vector suchthat the squared projection error is minimized in all directions, we can simply project the data onto the
largest eigenvectors. This is exactly what we called Principal Component Analysis in the previous section,where we showed that such projection also decorrelates the feature space.
A practical PCA application: Eigenfaces
Although the above examples are limited to two or three dimensions for visualization purposes,
dimensionality reduction usually becomes important when the number of features is not negligible
compared to the number of training samples. As an example, suppose we would like to perform facerecognition, i.e. determine the identity of the person depicted in an image, based on a training dataset of
labeled face images. One approach might be to treat the brightness of each pixel of the image as a feature.If the input images are of size 32×32 pixels, this means that the feature vector contains 1024 feature values.
Classifying a new face image can then be done by calculating the Euclidean distance between this 1024-
https://en.wikipedia.org/wiki/Total_least_squareshttp://www.visiondummy.com/wp-content/uploads/2014/05/regressionline_eigenvector.pnghttps://en.wikipedia.org/wiki/Total_least_squares
8/18/2019 vision dummy.pdf
41/51
dimensional vector, and the feature vectors of the people in our training dataset. The smallest distancethen tells us which person we are looking at.
However, operating in a 1024-dimensional space becomes problematic if we only have a few hundred
training samples. Furthermore, Euclidean distances behave strangely in high dimensional spaces as
discussed in an earlier article. Therefore, we could use PCA to reduce the dimensionality of the feature
space by calculating the eigenvectors of the covariance matrix of the set of 1024-dimensional featurevectors, and then projecting each feature vector onto the largest eigenvectors.
Since the eigenvector of 2D data is 2-dimensional, and an eigenvector of 3D data is 3-dimensional, the
eigenvectors of 1024-dimensional data is 1024-dimensional. In other words, we could reshape each of the
1024-dimensional eigenvectors to a 32×32 image for visualization purposes. Figure 10 shows the first four
eigenvectors obtained by eigendecomposition of the Cambridge face dataset:
Figure 10. The four largest eigenvectors, reshaped to images, resulting in so called EigenFaces.(source :https://nl.wikipedia.org/wiki/Eigenface)
Each 1024-dimensional feature vector (and thus each face) can now be projected onto the N largesteigenvectors, and can be represented as a linear combination of these eigenfaces. The weights of these
linear combinations determine the identity of the person. Since the largest eigenvectors represent thelargest variance in the data, these eigenfaces describe the most informative image regions (eyes, noise,
mouth, etc.). By only considering the first N (e.g. N=70) eigenvectors, the dimensionality of the featurespace is greatly reduced.
http://www.visiondummy.com/2014/04/curse-dimensionality-affect-classification/http://www.cl.cam.ac.uk/research/dtg/attarchive/facedatabase.htmlhttps://nl.wikipedia.org/wiki/Eigenfacehttp://www.visiondummy.com/wp-content/uploads/2014/05/Eigenfaces.pnghttps://nl.wikipedia.org/wiki/Eigenfacehttp://www.cl.cam.ac.uk/research/dtg/attarchive/facedatabase.htmlhttp://www.visiondummy.com/2014/04/curse-dimensionality-affect-classification/
8/18/2019 vision dummy.pdf
42/51
The remaining question is now how many eigenfaces should be used, or in the general case; how manyeigenvectors should be kept. Removing too many eigenvectors might remove important information from the
feature space, whereas eliminating too few eigenvectors leaves us with the curse of dimensionality.Regrettably there is no straight answer to this problem. Although cross-validation techniques can be used
to obtain an estimate of this hyperparameter, choosing the optimal number of dimensions remains a
problem that is mostly solved in an empirical (an academic term that means not much more than ‘trial -and- error’) manner. Note that it is often useful to check how much (as a perce ntage) of the variance of the
original data is kept while eliminating eigenvectors. This is done by dividing the sum of the kept eigenvaluesby the sum of all eigenvalues.
The PCA recipe
Based on the previous sections, we can now list the simple recipe used to apply PCA for feature extraction:
1) Center the data
In an earlier article, we showed that the covariance matrix can be written as a sequence of linear
operations (scaling and rotations). The eigendecomposition extracts these transformation matrices: the
eigenvectors represent the rotation matrix, while the eigenvalues represent the scaling factors. However,the covariance matrix does not contain any information related to the translation of the data. Indeed, to
represent translation, an affine transformation would be needed instead of a linear transformation.
Therefore, before applying PCA to rotate the data in order to obtain uncorrelated axes, any existing shift
needs to be countered by subtracting the mean of the data from each data point. This simply corresponds
to centering the data such that its average becomes zero.
2) Normalize the data
The eigenvectors of the covariance matrix point in the direction of the largest variance of the data.
However, variance is an absolute number, not a relative one. This means that the variance of data,measured in centimeters (or inches) will be much larger than the variance of the same data when
measured in meters (or feet). Consider the example where one feature represents the length of an objectin meters, while the second feature represents the width of the object in centimeters. The largest variance,
and thus the largest eigenvector, will implicitly be defined by the first feature if the data is not normalized.
To avoid this scale-dependent nature of PCA, it is useful to normalize the data by dividing each feature by
its standard deviation. This is especially important if different features correspond to different metrics.
https://en.wikipedia.org/wiki/Cross-validation_(statistics)http://www.visiondummy.com/2014/04/geometric-interpretation-covariance-matrix/http://www.visiondummy.com/2014/04/geometric-interpretation-covariance-matrix/https://en.wikipedia.org/wiki/Cross-validation_(statistics)
8/18/2019 vision dummy.pdf
43/51
3) Calculate the eigendecomposition
Since the data will be projected onto the largest eigenvectors to reduce the dimensionality,the eigendecomposition needs to be obtained. One of the most widely used methods to efficiently
calculate the eigendecomposition is Singular Value Decomposition (SVD).
4) Project the data
To reduce the dimensionality, the data is simply projected onto the largest eigenvectors. Let be the
matrix whose columns contain the largest eigenvectors and let be the original data whose columns
contain the different observations. Then the projected data is obtained as . We caneither choose the number of remaining dimensions, i.e. the columns of , directly, or we can define the
amount of variance of the original data that needs to kept while eliminating eigenvectors. If
only eigenvectors are kept, and represent the corresponding eigenvalues, then the amountof variance that remains after projecting the original -dimensional data can be calculated as:
(4)
PCA pitfalls
In the above discussion, several assumptions have been made. In the first section, we discussed how PCA
decorrelates the data. In fact, we started the discussion by expressing our desire to recover the unknown,underlying independent components of the observed features. We then assumed that our data was
normally distributed, such that statistical independence simply corresponds to the lack of a linearcorrelation. Indeed, PCA allows us to decorrelate the data, thereby recovering the independent
components in case of Gaussianity. However, it is important to note that decorrelation only corresponds to
statistical independency in the Gaussian case. Consider the data obtained by sampling half a period
of :
http://www.visiondummy.com/2014/03/eigenvalues-eigenvectors/https://en.wikipedia.org/wiki/Singular_value_decompositionhttps://en.wikipedia.org/wiki/Singular_value_decompositionhttp://www.visiondummy.com/2014/03/eigenvalues-eigenvectors/
8/18/2019 vision dummy.pdf
44/51
Figure 11 Uncorrelated data is only statistically independent if normally distributed. In this example a clearnon-linear dependency still exists: y=sin(x).
Although the above data is clearly uncorrelated (on average, the y-value increases as much as it decreases
when the x-value goes up) and therefore corresponds to a diagonal covariance matrix, there still is a clearnon-linear dependency between both variables.
In general, PCA only uncorrelates the data but does not remove statistical dependencies. If the underlyingcomponents are known to be non-Gaussian, techniques such as ICAcould be more interesting. On the
other hand, if non-linearities clearly exist, dimensionality reduction techniques such as non-linear PCA can
be used. However, keep in mind that these methods are prone to overfitting themselves, since moreparameters are to be estimated based on the same amount of training data.
A second assumption that was made in this article, is that the most discriminative information is capturedby the largest variance in the feature space. Since the direction of the largest variance encodes the most
information this is likely to be true. However, there are cases where the discriminative information actuallyresides in the directions of the smallest variance, such that PCA could greatly hurt classificationperformance. As an example, consider the two cases of figure 12, where we reduce the 2D feature space to
a 1D representation:
https://en.wikipedia.org/wiki/Independent_component_analysishttp://www.nlpca.org/http://www.visiondummy.com/wp-content/uploads/2014/05/sinx.pnghttp://www.nlpca.org/https://en.wikipedia.org/wiki/Independent_component_analysis
8/18/2019 vision dummy.pdf
45/51
Figure 12. In the first case, PCA would hurt classification performance because the data becomes linearly
unseparable. This happens when the most discriminative information resides in the smaller eigenvectors.
If the most discriminative information is contained in the smaller eigenvectors, applying PCA might actuallyworsen the Curse of Dimensionality because now a more complicated classification model (e.g. non-linear
classifier) is needed to classify the lower dimensional problem. In this case, other dimensionality reduction
methods might be of interest, such a sLinear Discriminant Analysis (LDA)which tries to find the projectionvector that optimally separates the two classes.
Source Code
The following code snippet shows how to perform principal component analysis for dimensionality
reduction in Matlab:
Matlab source code
Conclusion
In this article, we discussed the advantages of PCA for feature extraction and dimensionality reduction from
two different points of view. The first point of view explained how PCA allows us to decorrelate the feature
space, whereas the second point of view showed that PCA actually corresponds to orthogonal regression.
Furthermore, we briefly introduced Eigenfaces as a well known example of PCA based feature extraction,and we covered some of the most important disadvantages of Principal Component Analysis.
https://en.wikipedia.org/wiki/Linear_discriminant_analysishttp://www.visiondummy.com/wp-content/uploads/2014/05/pca.mhttp://www.visiondummy.com/wp-content/uploads/2014/05/pca_lda.pnghttp://www.visiondummy.com/wp-content/uploads/2014/05/pca.mhttps://en.wikipedia.org/wiki/Linear_discriminant_analysis
8/18/2019 vision dummy.pdf
46/51
What are eigenvectors and eigenvalues?
Contents [hide ] [hide ] 1 Introduction 2 Calculating the eigenvalues 3 Calculating the first eigenvector 4 Calculating the second eigenvector 5 Conclusion
IntroductionEigenvectors and eigenvalues have many important applications in computer vision and machine learning
in general. Well known examples are PCA (Principal Component Analysis)f or dimensionality reductionor EigenFaces for face recognition. An interesting use of eigenvectors and eigenvalues is also illustrated in
my post about error ellipses. Furthermore, eigendecomposition forms the base of the geometricinterpretation of covariance matrices, discussed in an more recent post. In this article, I will provide a
gentle introduction into this mathematical concept, and will show how to manually obtain the
eigendecomposition of a 2D square matrix.
An eigenvector is a vector whose direction remains unchanged when a linear transformation is applied to it.
Consider the image below in which three vectors are shown. The green square is only drawn to illustratethe linear transformation that is applied to each of these three vectors.
Eigenvectors (red) do not change direction when a linear transformation (e.g. scaling) is applied to them.Other vectors (yellow) do.
The transformation in this case is a simple scaling with factor 2 in the horizontal direction and factor 0.5 inthe vertical direction, such that the transformation matrix is defined as:
http://www.visiondummy.com/2014/03/eigenvalues-eigenvectors/http://www.visiondummy.com/2014/03/eigenvalues-eigenvectors/http://www.visiondummy.com/2014/03/eigenvalues-eigenvectors/http://www.visiondummy.com/2014/03/eigenvalues-eigenvectors/http://www.visiondummy.com/2014/03/eigenvalues-eigenvectors/http://www.visiondummy.com/2014/03/eigenvalues-eigenvectors/http://www.visiondummy.com/2014/03/eigenvalues-eigenvectors/#Introductionhttp://www.visiondummy.com/2014/03/eigenvalues-eigenvectors/#Introductionhttp://www.visiondummy.com/2014/03/eigenvalues-eigenvectors/#Calculating_the_eigenvalueshttp://www.visiondummy.com/2014/03/eigenvalues-eigenvectors/#Calculating_the_eigenvalueshttp://www.visiondummy.com/2014/03/eigenvalues-eigenvectors/#Calculating_the_first_eigenvectorhttp://www.visiondummy.com/2014/03/eigenvalues-eigenvectors/#Calculating_the_first_eigenvectorhttp://www.visiondummy.com/2014/03/eigenvalues-eigenvectors/#Calculating_the_second_eigenvectorhttp://www.visiondummy.com/2014/03/eigenvalues-eigenvectors/#Calculating_the_second_eigenvectorhttp://www.visiondummy.com/2014/03/eigenvalues-eigenvectors/#Conclusionhttp://www.visiondummy.com/2014/03/eigenvalues-eigenvectors/#Conclusionhttp://www.visiondummy.com/2014/05/feature-extraction-using-pca/http://www.visiondummy.com/2014/05/feature-extraction-using-pca/#A_practical_PCA_application_Eigenfaceshttp://www.visiondummy.com/2014/04/draw-error-ellipse-representing-covariance-matrix/http://www.visiondummy.com/2014/04/geometric-interpretation-covariance-matrix/http://www.visiondummy.com/wp-content/uploads/2014/03/eigenvectors.pnghttp://www.visiondummy.com/2014/04/geometric-interpretation-covariance-matrix/http://www.visiondummy.com/2014/04/draw-error-ellipse-representing-covariance-matrix/http://www.visiondummy.com/2014/05/feature-extraction-using-pca/#A_practical_PCA_application_Eigenfaceshttp://www.visiondummy.com/2014/05/feature-extraction-using-pca/http://www.visiondummy.com/2014/03/eigenvalues-eigenvectors/#Conclusionhttp://www.visiondummy.com/2014/03/eigenvalues-eigenvectors/#Calculating_the_second_eigenvectorhttp://www.visiondummy.com/2014/03/eigenvalues-eigenvectors/#Calculating_the_first_eigenvectorhttp://www.visiondummy.com/2014/03/eigenvalues-eigenvectors/#Calculating_the_eigenvalueshttp://www.visiondummy.com/2014/03/eigenvalues-eigenvectors/#Introductionhttp://www.visiondummy.com/2014/03/eigenvalues-eigenvectors/http://www.visiondummy.com/2014/03/eigenvalues-eigenvectors/
8/18/2019 vision dummy.pdf
47/51
.
A vector is then scaled by applying this transformation as . The above figure
shows that the direction of some vectors (shown in red) is not affected by this linear transformation. These
vectors are called eigenvectors of the transformation, and uniquely define the square matrix . This
unique, deterministic relation is exactly the reason that those vectors are called ‘eigenvectors’ (Eigen
means ‘specific’ in German).
In general, the eigenvector of a matrix is the vector for which the following holds:
(1)
where is a scalar value called the ‘eigenvalue’. This means that the linear transformation onvector is completely defined by .
We can rewrite equation (1) as follows:
(2)
where is the identity matrix of the same dimensions as .
However, assuming that is not the null-vector, equation (2) can only be defined if is not
invertible. If a square matrix is not invertible, that means that it sdeterminant must equal zero. Therefore, tofind the eigenvectors of , we simply have to solve the following equation:
(3)
In the following sections we will determine the eigenvectors and eigenvalues of a matrix , by solving
equation (3). Matrix in this example, is defined by:
(4)
http://www.visiondummy.com/2014/03/eigenvalues-eigenvectors/#id3583665669http://www.visiondummy.com/2014/03/eigenvalues-eigenvectors/#id1398496403https://nl.wikipedia.org/wiki/Determinanthttp://www.visiondummy.com/2014/03/eigenvalues-eigenvectors/#id1043422129http://www.visiondummy.com/2014/03/eigenvalues-eigenvectors/#id1043422129https://nl.wikipedia.org/wiki/Determinanthttp://www.visiondummy.com/2014/03/eigenvalues-eigenvectors/#id1398496403http://www.visiondummy.com/2014/03/eigenvalues-eigenvectors/#id3583665669
8/18/2019 vision dummy.pdf
48/51
Calculating the eigenvalues
To determine the eigenvalues for this example, we substitute in equation (3) by equation (4) and
obtain:
(5)
Calculating the determinant gives:
(6)
To solve this quadratic equation in , we find the discriminant:
Since the discriminant is strictly positive, this means that two different values for exist:
(7)
http://www.visiondummy.com/2014/03/eigenvalues-eigenvectors/#id1043422129http://www.visiondummy.com/2014/03/eigenvalues-eigenvectors/#id3888381481http://www.visiondummy.com/2014/03/eigenvalues-eigenvectors/#id3888381481http://www.visiondummy.com/2014/03/eigenvalues-eigenvectors/#id1043422129
8/18/2019 vision dummy.pdf
49/51
We have now determined the two eigenvalues and . Note that a square matrix ofsize always has exactly eigenvalues, each with a corresponding eigenvector. The eigenvalue
specifies the size of the eigenvector.
Calculating the first eigenvector
We can now determine the eigenvectors by plugging the eigenvalues from equation (7) into equation (1)
that originally defined the problem. The eigenvectors are then found by solving this system of equations.
We first do this for eigenvalue , in order to find the corresponding first eigenvector:
Since this is simply the matrix notation for a system of equations, we can write it in its equivalent form:
(8)
and solve the first equation as a function of , resulting in:
(9)
Since an eigenvector simply represents an orientation (the corresponding eigenvalue represents the
magnitude), all scalar multiples of the eigenvector are vectors that are parallel to this eigenvector, and aretherefore equivalent (If we would normalize the vectors, they would all be equal). Thus, instead of further
solving the above system of equations, we can freely chose a real value for either or , anddetermine the other one by using equation (9).
For this example, we arbitrarily choose , such that . Therefore, the eigenvector that
corresponds to eigenvalue is
http://www.visiondummy.com/2014/03/eigenvalues-eigenvectors/#id2474539270http://www.visiondummy.com/2014/03/eigenvalues-eigenvectors/#id3583665669http://www.visiondummy.com/2014/03/eigenvalues-eigenvectors/#id2280642976http://www.visiondummy.com/2014/03/eigenvalues-eigenvectors/#id2280642976http://www.visiondummy.com/2014/03/eigenvalues-eigenvectors/#id3583665669http://www.visiondummy.com/2014/03/eigenvalues-eigenvectors/#id2474539270
8/18/2019 vision dummy.pdf
50/51
(10)
Calculating the second eigenvector
Calculations for the second eigenvector are similar to those needed for the first eigenvector;We now substitute eigenvalue into equation (1), yielding:
(11)
Written as a system of equations, this is equivalent to:
(12)
Solving the first equation as a function of resuls in:
(13)
We then arbitrarily choose , and find . Therefore, the eigenvector that corresponds toeigenvalue is
(14)
http://www.visiondummy.com/2014/03/eigenvalues-eigenvectors/#id3583665669http://www.visiondummy.com/2014/03/eigenvalues-eigenvectors/#id3583665669
8/18/2019 vision dummy.pdf
51/51
Conclusion
In this article we reviewed the theoretical concepts of eigenvectors and eigenvalues. These concepts are of
great importance in many techniques used in computer vision and machine learning, such asdimensionality reduction by means of PCA, or face recognition by means of EigenFaces.