+ All Categories
Home > Technology > 8323 Stats - Lesson 1 - 04 Multivariate Vectors And Samples 2008

8323 Stats - Lesson 1 - 04 Multivariate Vectors And Samples 2008

Date post: 29-Nov-2014
Category:
Upload: untellectualism
View: 2,524 times
Download: 0 times
Share this document with a friend
Description:
8323 Stats - Lesson 1 - 04 Multivariate Vectors And Samples 2008
Popular Tags:
51
1 Multivariate Samples Recall some very basic concepts of univariate and bivariate statistics Describe Multivariate Samples Analyze multivariate samples in a geometrical perspective Describe distances in the Euclidean Space
Transcript
Page 1: 8323 Stats - Lesson 1 - 04 Multivariate Vectors And Samples 2008

1

Multivariate Samples

Recall some very basic concepts of univariate and bivariate statistics

Describe Multivariate Samples

Analyze multivariate samples in a geometrical perspective

Describe distances in the Euclidean Space

Page 2: 8323 Stats - Lesson 1 - 04 Multivariate Vectors And Samples 2008

2

Example1. Innovation and Research in Europe (Source: Eurostat)

Geo Country code

Country Country name

Region european region

Educ_Exp Spending on Human Resources (total public expen. on education) - % of GDP

GERD Gross domestic expenditure on R&D (GERD) - As a % of GDP

GERD_industry GERD - industry - % of GERD financed by industry

GERD_govern GERD - government - % of GERD financed by government

GERD_abroad GERD - abroad - % of GERD financed by abroad

Internet_Acc Level of Internet access - % of households who have Internet access at home

ST_grad Science and technology - Tertiary graduates in S&T x 1000 persons aged 20-29

ST_grad_f Female tertiary graduates in S&T per 1000 of females aged 20-29

ST_grad_m Male tertiary graduates in S&T per 1000 of males aged 20-29

EPO No patent applications to the European Patent Office per million inhabitants

USTPO No patents granted by the US Patent and Trademark Office per million inhabitants

IT_Expenditure Expenditure on Information Technology as a % of GDP

Telec_Expenditure Expenditure on Telecommunications as a % of GDP

Y_Educ_Lev Youth education attainment level - total - % of the population 20-24 who completed at least upper secondary education

Y_Educ_Lev_f % of fem. 20-24 having completed at least upper 2° educ.

Y_Educ__Lev_m % of males 20-24 having completed at least upper 2° educ.

E_gov_avail E-government on-line availability - Online availability of 20 basic public services

HT_Exports Exports of high technology products as a share of total exports

The data we will consider

Page 3: 8323 Stats - Lesson 1 - 04 Multivariate Vectors And Samples 2008

3

Some basic concepts of Univariate and Bivariate statistics

Page 4: 8323 Stats - Lesson 1 - 04 Multivariate Vectors And Samples 2008

40

40

80

120

160

200

240

280

320

Back to basics…. Considering one variableLet us consider one variable of interest, say EPOcountry region Internet_Acc EPO

Romania Eastern 6.00 1.31

Czech Republic Eastern 19.00 12.04

Lithuania Northern 12.00 2.78

Ireland Northern 40.00 79.87

Norway Northern 60.00 135.77

UK Northern 56.00 124.19

Finland Northern 51.00 309.09

Sweden Northern 73.00 293.32

Greece Southern 17.00 9.87

Italy Southern 34.00 84.14

Spain Southern 34.00 30.64

Netherlands Western 67.00 246.15

Belgium Western 50.00 141.80

Germany Western 60.00 299.99

France Western 34.00 144.52

In statistics a commonly used position measure is the arithmetic (sample) mean, obtained by summing up all the observed values and dividing the results by the nr of obs

Netherlands

Spain

Mean = 127.6987

6987.127 EPO ofMean epo

Page 5: 8323 Stats - Lesson 1 - 04 Multivariate Vectors And Samples 2008

50

40

80

120

160

200

240

280

320

Back to basics…. Considering one variable

The mean can be used to make a “prediction” about EPO for a generic country without any further information.

To evaluate the reliability of the mean as a synthesis of the observed data, we can consider for each observed value the error incurred when substituting it with the sample mean.

Netherlands

Spain

Mean = 127.6987

215

1EPO )( SS Total epoepo

ii

In the plot: errors incurred when substituting the mean to the values observed for Netherlands and Spain respectively.

The TOTAL SUM OF SQUARES is the sum of the squared errors

Variable of interest: EPO

Page 6: 8323 Stats - Lesson 1 - 04 Multivariate Vectors And Samples 2008

6

Back to basics…. Considering one variable

A synthesis of the errors, and a measure of the reliability of the mean as a synthesis of the observed data, is the (sample) variance

215

1

215

1

)6987.127(14

1 )(

14

1 EPO ofVar

i

ii

i epoepoepo

This is the average of the squared errors we incur when substituting the observed values with the sample mean. It is obtained by dividing the Total SS by the number of observations (minus 1)

The variance of EPO turns out to be 12646.5814. Hence the error we can expect to incur for a generic observation is the square root of the variance, which is called standard deviation

112.4575814.12626

EPO ofVar EPO of Dev. Std.

Variable of interest: EPO

Page 7: 8323 Stats - Lesson 1 - 04 Multivariate Vectors And Samples 2008

7

Back to basics…. Considering one variable

In statistics we are mainly concerned with the explanation of variance, i.e., we are interested in explaining why a phenomenon varies and, also, we are considering predictive tools characterized by low prediction errors.

So the question now is: Can we do better than the mean?

i.e., can we use external information (other vars) related to EPO, and hence proving useful to predict the values of EPO with a lower error?

In the following we will consider two supporting variables having different characteristics:

The Region (a categorical variable)

Internet_Access (a numerical variable)

and we will show how it is possible to evaluate the extent to which one external variable provides information about the variable of interest

Variable of interest: EPO

Page 8: 8323 Stats - Lesson 1 - 04 Multivariate Vectors And Samples 2008

8

Back to basics…. Considering one variableIf we consider the region, our prediction on EPO can be better?

country region EPO

Romania Eastern 1.31

Czech Republic Eastern 12.04

Lithuania Northern 2.78

Ireland Northern 79.87

Norway Northern 135.77

UK Northern 124.19

Finland Northern 309.09

Sweden Northern 293.32

Greece Southern 9.87

Italy Southern 84.14

Spain Southern 30.64

Netherlands Western 246.15

Belgium Western 141.80

Germany Western 299.99

France Western 144.52

0

40

80

120

160

200

240

280

320

General Mean = 127.6987

We can use the conditional means rather than the general one.

It is worth only if the prediction error is considerably lower (it can be shown that it is lower by construction)

Netherlands

Spain

Values observed within the regions

Page 9: 8323 Stats - Lesson 1 - 04 Multivariate Vectors And Samples 2008

9

Back to basics…. Considering one variableConsider the region to improve prediction on EPO

Use the conditional means

0

40

80

120

160

200

240

280

320

Netherlands

Spain

To evaluate the reliability of the conditional means as syntheses of the observed EPO data, we can consider the squared difference between each value and the proper conditional mean.

In the plot: errors for Netherlands and Spain

The WITHIN SUM OF SQUARES of EPO given Region is the sum of the squared errors incurred when using the conditional means (by region) to predict EPO

Page 10: 8323 Stats - Lesson 1 - 04 Multivariate Vectors And Samples 2008

10

Back to basics…. Considering one variable

country region EPO General mean Squared errors Conditional means Squared errors

Romania Eastern 1.31 127.6987 15974.1035 6.675 28.7832

Czech Republic Eastern 12.04 127.6987 13376.9349 6.675 28.7832

Lithuania Northern 2.78 127.6987 15604.6816 157.5033 23939.3

Ireland Northern 79.87 127.6987 2287.5845 157.5033 6026.929

Norway Northern 135.77 127.6987 65.1459 157.5033 472.3363

UK Northern 124.19 127.6987 12.311 157.5033 1109.776

Finland Northern 309.09 127.6987 32902.8037 157.5033 22978.53

Sweden Northern 293.32 127.6987 27430.415 157.5033 18446.18

Greece Southern 9.87 127.6987 13883.6025 41.55 1003.622

Italy Southern 84.14 127.6987 1897.3603 41.55 1813.908

Spain Southern 30.64 127.6987 9420.3912 41.55 119.0281

Netherlands Western 246.15 127.6987 14030.7105 208.115 1446.661

Belgium Western 141.8 127.6987 198.8467 208.115 4397.679

Germany Western 299.99 127.6987 29684.2921 208.115 8441.016

France Western 144.52 127.6987 282.9561 208.115 4044.324

TOTAL SSEPO = 177052.1395 WITHIN SSEPO | REGION = 94296.85

If we use the region, our improvement as compared to the general mean is

467.0SS TOTAL

SS WITHIN1 R

REGEPO|

REGEPO|2REGEPO|

The R2 ranges from 0 to 1. It measures the ability of the categorical var as a predictor of the numerical one.% of variance of EPO accounted for by Region

Compare general mean / conditional means as predictors of EPO

Page 11: 8323 Stats - Lesson 1 - 04 Multivariate Vectors And Samples 2008

11

Back to basics…. Considering one variableIf we consider Internet_Access, our prediction on EPO can be better?

country Internet_Acc EPO

Romania 6.00 1.31

Czech Republic 19.00 12.04

Lithuania 12.00 2.78

Ireland 40.00 79.87

Norway 60.00 135.77

UK 56.00 124.19

Finland 51.00 309.09

Sweden 73.00 293.32

Greece 17.00 9.87

Italy 34.00 84.14

Spain 34.00 30.64

Netherlands 67.00 246.15

Belgium 50.00 141.80

Germany 60.00 299.99

France 34.00 144.52

When considering numerical variables, we are interested in evaluating the existence of a linear association between them.

0

50

100

150

200

250

300

350

0 10 20 30 40 50 60 70 80

Internet_Access

EPO

))((14

1 Int_Acc)(EPO, Cov

15

1

int_accint_accepoepo ii

i

To evaluate if a linear relationship exists and to determine its direction we refer to the sample covariance (absolute measure of linear association)

Page 12: 8323 Stats - Lesson 1 - 04 Multivariate Vectors And Samples 2008

12

Back to basics…. Considering one variableIf we consider Internet_Access, our prediction on EPO can be better?

The covariance between the two variables is:

Cov(EPO, Int_Acc) = 1868.5152

This measure only indicates that a linear relationship exists and that it is direct (an inspection of the scatter plot confirms this). Nevertheless, the value of the covariance depends upon the unit of measurement of the considered variables. A relative measure of linear association is the correlation coefficient.

(Int_Acc)Var (EPO)Var

Int_Acc)(EPO, CovInt_Acc)(EPO,Corr

The correlation coefficient ranges from – 1 to +1. Values close to 1 indicate strong direct linear association, values close to –1 denote strong inverse association. Values close to zero indicate no relationship.

Here we have Corr(EPO, Int_Acc) = 0.8527 (strong association)

Page 13: 8323 Stats - Lesson 1 - 04 Multivariate Vectors And Samples 2008

13

Back to basics…. Considering one variableIf we consider Internet_Access, our prediction on EPO can be better?

-50

0

50

100

150

200

250

300

350

0 10 20 30 40 50 60 70 80

Internet_Access

EPO

EPO = –60.018 + 4.5934*Int_Acc

The high value of the correlation tells us that observations tend to cluster around a line having a positive slope. This line, evidenced in the scatterplot is called regression line.

Its analytical expression can be easily determined

Page 14: 8323 Stats - Lesson 1 - 04 Multivariate Vectors And Samples 2008

14

Back to basics…. Considering one variable

-50

0

50

100

150

200

250

300

350

0 10 20 30 40 50 60 70 80

Internet_Access

EPO

EPO = –60.018 + 4.5934*Int_Acc

For each observation we can calculate the difference between the observed EPO value and the value predicted using the regression line.

In the plot the error is evidenced for the Spain.

Spain

The MODEL SUM OF SQUARES of EPO given Int_Acc is the sum of the squared errors incurred when using the line to predict EPO.

Consider Internet_Access to improve prediction on EPO

Use the regression line

Page 15: 8323 Stats - Lesson 1 - 04 Multivariate Vectors And Samples 2008

15

Back to basics…. Considering one variable

country Int_Acc EPO Gen mean Squared errors

Prediction using the line

=4.5934*Int_Acc-60.018

Squared errors

Romania 6.00 1.31 127.6987 15974.1035 -32.4576 1140.251

Czech Republic 19.00 12.04 127.6987 13376.9349 27.2566 231.5449

Lithuania 12.00 2.78 127.6987 15604.6816 -4.8972 58.9394

Ireland 40.00 79.87 127.6987 2287.5845 123.718 1922.647

Norway 60.00 135.77 127.6987 65.1459 215.586 6370.594

UK 56.00 124.19 127.6987 12.311 197.2124 5332.271

Finland 51.00 309.09 127.6987 32902.8037 174.2454 18183.07

Sweden 73.00 293.32 127.6987 27430.415 275.3002 324.7132

Greece 17.00 9.87 127.6987 13883.6025 18.0698 67.2367

Italy 34.00 84.14 127.6987 1897.3603 96.1576 144.4227

Spain 34.00 30.64 127.6987 9420.3912 96.1576 4292.556

Netherlands 67.00 246.15 127.6987 14030.7105 247.7398 2.5275

Belgium 50.00 141.8 127.6987 198.8467 169.652 775.7339

Germany 60.00 299.99 127.6987 29684.2921 215.586 7124.035

France 34.00 144.52 127.6987 282.9561 96.1576 2338.922

TOTAL SSEPO = 177052.1395 MODEL SSEPO | Int_Acc = 48309.46

Notice that we have a considerable decrease of the prediction errors.

Compare general mean / regression line as predictors of EPO

Page 16: 8323 Stats - Lesson 1 - 04 Multivariate Vectors And Samples 2008

16

Back to basics…. Considering one variable

7271.0SS TOTAL

SS MODEL1 R

Int_AccEPO|

Int_AccEPO|2Int_AccEPO|

The R2 index ranges from 0 to 1 and it measures the ability of the numerical var to predict the other one.

It can be shown that the index coincides with the squared correlation coefficient.

Hence the correlation measures the extent of linear association, whereas its square measures the percentage of the variance of one variable which can be explained by the other variable (numerical).

If we use the line (function of Int_Acc), our improvement as compared to the general mean is

% of variance of EPO accounted for by Int_Acc

If we consider Internet_Access, our prediction on EPO can be better?

Page 17: 8323 Stats - Lesson 1 - 04 Multivariate Vectors And Samples 2008

17

Data Matrices(Numerical variables only)

Page 18: 8323 Stats - Lesson 1 - 04 Multivariate Vectors And Samples 2008

18

Data matrices Example1 (continued). Innovation and Research in Europe. For the sake of simplicity, we limit attention to few variables and to few observations

country region GERD GERD_ industry

GERD_govern

Internet_Acc

ST_ grad

EPO E_gov_avail

Romania Eastern 0.39 47.60 43.00 6.00 5.80 1.31 25.00

Czech Republic Eastern 1.20 52.50 43.60 19.00 6.00 12.04 30.00

Lithuania Northern 0.67 37.10 56.30 12.00 14.60 2.78 40.00

Ireland Northern 1.10 66.70 25.60 40.00 20.50 79.87 50.00

Norway Northern 1.60 51.60 39.80 60.00 7.70 135.77 56.00

UK Northern 1.83 45.60 28.80 56.00 20.30 124.19 59.00

Finland Northern 3.30 70.80 25.50 51.00 17.40 309.09 67.00

Sweden Northern 4.25 71.50 21.30 73.00 13.30 293.32 74.00

Greece Southern 0.64 33.00 46.60 17.00 8.00 9.87 32.00

Italy Southern 1.09 47.20 46.80 34.00 7.40 84.14 53.00

Spain Southern 0.91 47.20 39.90 34.00 11.90 30.64 55.00

Netherlands Western 1.80 51.90 35.80 67.00 6.60 246.15 32.00

Belgium Western 2.08 63.40 22.00 50.00 10.50 141.80 35.00

Germany Western 2.46 65.70 31.40 60.00 8.10 299.99 47.00

France Western 2.20 54.20 36.90 34.00 19.50 144.52 50.00

The country variable is useful to identify the statistical units but it is not object of analysis.

At the moment we consider only numerical variables

For each observation we have information collected on p variables

For each variable we have information collected on n observations

The data matrix contains information available for the n cases (rows) on the p variables (columns)

Here we have 15 rows (cases, n) and 7 columns (vars, p)

Page 19: 8323 Stats - Lesson 1 - 04 Multivariate Vectors And Samples 2008

19

Data matrices Example1 (continued). Innovation and Research in Europe. (subset)

GERD GERD_ industry

GERD_govern

Internet_Acc

ST_ grad

EPO E_gov_ava

il

Romania 0.39 47.60 43.00 6.00 5.80 1.31 25.00

Czech Republic 1.20 52.50 43.60 19.00 6.00 12.04 30.00

Lithuania 0.67 37.10 56.30 12.00 14.60 2.78 40.00

Ireland 1.10 66.70 25.60 40.00 20.50 79.87 50.00

Norway 1.60 51.60 39.80 60.00 7.70 135.77 56.00

UK 1.83 45.60 28.80 56.00 20.30 124.19 59.00

Finland 3.30 70.80 25.50 51.00 17.40 309.09 67.00

Sweden 4.25 71.50 21.30 73.00 13.30 293.32 74.00

Greece 0.64 33.00 46.60 17.00 8.00 9.87 32.00

Italy 1.09 47.20 46.80 34.00 7.40 84.14 53.00

Spain 0.91 47.20 39.90 34.00 11.90 30.64 55.00

Netherlands 1.80 51.90 35.80 67.00 6.60 246.15 32.00

Belgium 2.08 63.40 22.00 50.00 10.50 141.80 35.00

Germany 2.46 65.70 31.40 60.00 8.10 299.99 47.00

France 2.20 54.20 36.90 34.00 19.50 144.52 50.00

To each observation a collection of p values is associated. These values are the realizations observed for each variables corresponding to the considered obs.

Similarly, to each variable, a collection of n values can be associated (values observed for all the cases)

A collection of k values is usually called a vector. To avoid confusion, we will only consider column vectors, with dimension (k 1) – i.e., a collection of values arranged in k rows and in 1 column .

A row (1 k) vector can always be seen as the transpose of a column (k 1) vector.

Page 20: 8323 Stats - Lesson 1 - 04 Multivariate Vectors And Samples 2008

20

Data matrices

ipii

ip

i

i xxx

x

x

x

i21

Ti2

1

xx

npnn

p

p

xxx

xxx

xxx

21

22221

11211

X

xi = vector (p 1) containing measurements on the p vars for the i-th case.

x(j) = vector (n 1) containing the n measurements on the j-th variable

Data matrix (n individuals and p variables)

Transposition operation

Tn

T

T

x

x

x

2

1

A data matrix can be seen as a collection of n row (transposed) vectors (cases) and/or as a collection of p column vectors (variables)

)((2))1( pxxx

Page 21: 8323 Stats - Lesson 1 - 04 Multivariate Vectors And Samples 2008

21

Data matrices Example1 (continued). Innovation and Research in Europe. (subset)

GERD GERD_ industry

GERD_govern

Internet_Acc

ST_ grad

EPO E_gov_ava

il

Romania 0.39 47.60 43.00 6.00 5.80 1.31 25.00

Czech Republic 1.20 52.50 43.60 19.00 6.00 12.04 30.00

Lithuania 0.67 37.10 56.30 12.00 14.60 2.78 40.00

Ireland 1.10 66.70 25.60 40.00 20.50 79.87 50.00

Norway 1.60 51.60 39.80 60.00 7.70 135.77 56.00

UK 1.83 45.60 28.80 56.00 20.30 124.19 59.00

Finland 3.30 70.80 25.50 51.00 17.40 309.09 67.00

Sweden 4.25 71.50 21.30 73.00 13.30 293.32 74.00

Greece 0.64 33.00 46.60 17.00 8.00 9.87 32.00

Italy 1.09 47.20 46.80 34.00 7.40 84.14 53.00

Spain 0.91 47.20 39.90 34.00 11.90 30.64 55.00

Netherlands 1.80 51.90 35.80 67.00 6.60 246.15 32.00

Belgium 2.08 63.40 22.00 50.00 10.50 141.80 35.00

Germany 2.46 65.70 31.40 60.00 8.10 299.99 47.00

France 2.20 54.20 36.90 34.00 19.50 144.52 50.00

T13x

)6(x

Row vector associated to “Belgium” (measurements on 7 vars)

Column vector associated to EPO (measurements on 15 obs)

The element in the i-th row and in the j-th column, xij is the value observed for the i-th case corresponding to the j-th variable.

In this simple example, x13 6 is the value of EPO (6° variable) for Belgium (13° observation).

Page 22: 8323 Stats - Lesson 1 - 04 Multivariate Vectors And Samples 2008

22

Data matrices – Vectors A (K 1) vector is as an oriented line in a K-dimensional space

v1

v2

v3

v1

v2

2

1

v

vv

A two-dimensional vector

2

1

v

vv

A three-dimensional vector

3

2

1

v

v

v

v

3

2

1

v

v

v

v

Vectors of higher dimension cannot be represented in this way

A one-dimensional vector (scalar)

][ 1vv

v1

Page 23: 8323 Stats - Lesson 1 - 04 Multivariate Vectors And Samples 2008

23

Data matrices – Vectors (length)For a given vector in the k-dimensional space, we define its length as:

222

21 ...|||| Kvvv v

It is the length of the line connecting v to the origin, 0:

v1

v2

v3

v1

v2

2

1

v

vv

v

3

2

1

v

v

v

v v

][ 1vvv1

0

0

0

Page 24: 8323 Stats - Lesson 1 - 04 Multivariate Vectors And Samples 2008

24

Data matrices – Vectors (Distance)

0

v

2222

211 )(...)()(),( kkE uvuvuvD uvuv

v1

v2

u

u1

u2

Given two vectors, v and u in the k-dimensional space, we define the Euclidean Distance between v and u as the length of the line connecting v to u:

|v1 – u1|

|v2 – u2|

!!! the length of a vector v coincides with its distance from the origin, 0.

Example in the two-dimensional space

),(...|||| 222

21 0vv EK Dvvv

Page 25: 8323 Stats - Lesson 1 - 04 Multivariate Vectors And Samples 2008

25

Analyze multivariate samples in a geometrical perspective

Describe distances in the Euclidean Space

Page 26: 8323 Stats - Lesson 1 - 04 Multivariate Vectors And Samples 2008

26

Data matrices

A data matrix can be see as a collection of two kind of vectors:

Row vectors: xi lie in the p-dimensional space

Column vectors: x(j) lie in the n-dimensional space

Hence two dimensional spaces can be considered to analyze/describe a data matrix.

Of course, these spaces will be related one to each other.

For the sake of simplicity, we will analyze in depth only the space of the observations.

Page 27: 8323 Stats - Lesson 1 - 04 Multivariate Vectors And Samples 2008

27

Syntheses of variables

px

x

1

x

] [ )((2))1( pxxxX

j

n

iij

j xn

x 1

)( x

The position. The sample mean (unbiased estimator for the population mean) for the j-th variable (column) is:

It may be seen as the vector associated to the “artificial case” “mean” – an unobserved case being in the average with respect to all the vars

Remember: the mean is not robust (sensitive to extreme values)

How to arrange syntheses of p variables, i.e., how to synthesize the elements of the column vectors?

Vector of the sample means (centroid).

Page 28: 8323 Stats - Lesson 1 - 04 Multivariate Vectors And Samples 2008

28

The space of the observationsConsider a graphical representation we are used to: the 2-dimensional space

Note: axes adjusted to have the same scale.

Mean of E_gov_indiv

Mean of Internet_Acc

The centroid (vector whose elements are the sample means) is the centre of gravity of the cloud.

It is the point which is globally less distant from all the points.

Page 29: 8323 Stats - Lesson 1 - 04 Multivariate Vectors And Samples 2008

29

Synthesis of variables

Notice that it is the average of the squared distances between the observed values and the sample mean

2

1

)(1

1 j

n

iijjj xx

ns

•The sample standard deviation for the j-th variable (column) is jjs

The dispersion around the mean.

•The sample variance (unbiased estimator for the population variance) for the j-th variable (column) is:

The Std. Dev has the same unit of measurement as the variable taken into account. It measures of the expected error (below or above the mean) we incur when substituting the mean to a generic case.

Moreover it can be considered as the average distance between a generic value and the mean. It is the expected distance from mean.

Being based upon averages, both the variance and the standard deviation are not robust (sensitive to extreme values)

Average of the squared errors we incur when substituting the observed values with the sample mean.

Page 30: 8323 Stats - Lesson 1 - 04 Multivariate Vectors And Samples 2008

30

The space of the observationsConsider again the 2-dimensional space

Let us consider the distance from Iceland (IS) to the centroid

Note: axes adjusted to have the same scale.

Absolute Difference between the Iceland E_gov_Indiv value and the mean of E_gov_Indiv

Absolute Difference between the Iceland Internet_Acc value and the mean of Internet_Acc

2,

2__, )()(),( int_accint_accISindegovindegovISE xxxxISD x

Page 31: 8323 Stats - Lesson 1 - 04 Multivariate Vectors And Samples 2008

31

The space of the observationsConsider, in the 2-dimensional space, ALL THE DISTANCES FROM POINTS TO THE CENTROID.

Note: axes adjusted to have the same scale.

2

1

)(1

1 j

n

iijjj xx

ns

Var(E_gov_indiv) + Var(Internet_cc)

= SUM of the variances of THE TWO VARIABLES

is proportional to the sum of the squared distances from the obs to the centroid

Page 32: 8323 Stats - Lesson 1 - 04 Multivariate Vectors And Samples 2008

32

Synthesis of association between vars

))((1

1

1hihj

n

iijjh xxxx

ns

The linear association.

The sample covariance for the j-th and the h-th variables (columns) is

hhjj

jhjh

ss

sr

The sample correlation coefficient for the j-th and the h-th variables is

(absolute measure of linear association)

(relative measure of linear association; it ranges from – 1 to +1).

Remember: being based upon averages, the correlation coefficient is not robust (sensitive to extreme values)

Page 33: 8323 Stats - Lesson 1 - 04 Multivariate Vectors And Samples 2008

33

The space of the observationsConsider again the 2-dimensional space

Since the covariance and the correlations are actually measuring the concentration of points around a line, both the indices give us information about the ORIENTATION of the scatter.

Note: axes adjusted to have the same scale.

Page 34: 8323 Stats - Lesson 1 - 04 Multivariate Vectors And Samples 2008

34

Variance and Covariance Matrix

pppp

p

p

sss

sss

sss

21

22221

11211

S

Variances and covariances are arranged in the so called variance and covariance matrix

S is a square matrix (number of rows equals the number of columns)

The diagonal elements of S, sjj, are the variances (notice that the variance can be regarded as the covariance between one variable and itself)

The extra-diagonal elements of S, sjh, are the covariances

Since sjh = shj, S is a symmetric matrix.

pppp

p

p

sss

sss

sss

21

22221

11211

S

Page 35: 8323 Stats - Lesson 1 - 04 Multivariate Vectors And Samples 2008

35

Correlation Matrix

Correlations are arranged in the correlation matrix

1

1

1

21

221

112

pp

p

p

rr

rr

rr

R

hhjj

jhjh

ss

sr

R is also a square matrix, and its diagonal elements are 1’s (the correlation between one variable and itself is 1)

Its extra-diagonal elements, rjh, are the correlations, and of course, R is a symmetric matrix.

Due to the relationship between covariances and correlations:

R can be simply obtained from the variance and covariance matrix

Page 36: 8323 Stats - Lesson 1 - 04 Multivariate Vectors And Samples 2008

36

The space of the observations

The centroid (vector whose elements are the sample means) is the centre of gravity of the p-dimensional cloud

The elements of the variance and covariance matrix give us information about the dispersion around the centroid (remember the 2-dimension example) and on the orientation of the cloud

Page 37: 8323 Stats - Lesson 1 - 04 Multivariate Vectors And Samples 2008

37

Measuring dispersionHow to synthesize the dispersion of the n cases in the p-dimensional space? Two proposals.

TOTAL VARIANCE

• As we saw before, the sum of all the variances is proportional to the sum of the squared distances from the points to the centroid. Thus, a first method to evaluate the dispersion of the points in the p-dimensional space is the so called Total Variance.

p

jjjs

1

Variance Total

The Total Variance is the sum of the diagonal elements of the var/cov matrix, S. The sum of the diagonal elements of a square matrix is defined to be its trace. Hence, we have:

Notice that we are not taking into account the interrelationships between vars, i.e. the orientation of the cloud.

)(tr)(TraceVariance Total1

SS

p

jjjs

Page 38: 8323 Stats - Lesson 1 - 04 Multivariate Vectors And Samples 2008

38

The space of the observationsTo motivate the second measure of multivariate dispersion, consider the “portion” of the space which is occupied by data (area of the ellipse). We will come back to this concept later, but can intuitively understand that the area of the ellipse (in higher-dimensional space, the volume of an ellipsoid) is somehow related to the variances and to the covariances, i.e., to all the entries of the var/cov matrix, S

Page 39: 8323 Stats - Lesson 1 - 04 Multivariate Vectors And Samples 2008

39

THE GENERALIZED VARIANCE

•The volume of the ellipsoid containing points in the p-dimensional space can be shown to be related to a particular synthesis of the elements of S, the so called determinant of S, |S|.

The determinant is a number which can be calculated for a square matrix. It equals zero if two column of the matrix are proportional, i.e., if they do share information.

This measure is called Generalized Variance

Generalized Variance = det(S)=|S|

Hence, to synthesize the dispersion of points in a p- dimensional space, two measure can be used, both related to the elements of the variance and covariance matrix, S.

The Total Variance takes into account only the diagonal elements of S, whilst the Generalized variance is calculated by referring to all the elements of S.

Measuring dispersion

Page 40: 8323 Stats - Lesson 1 - 04 Multivariate Vectors And Samples 2008

40

The space of the observationsThe variances and covariance matrix contains relevant information to describe the points in a p-dimensional space, and, also information about their distances. We now consider different measures of distances between cases in the p-dimensional space, related to particular transformations of the original vars.

Notice first that if the variables are centred on their mean nothing changes as concerns the dispersion of the points.

This operation only consists in a change of the origin

Page 41: 8323 Stats - Lesson 1 - 04 Multivariate Vectors And Samples 2008

41

Multivariate Samples - Transformations

)()(

)()(

~

11

1111

pnpn

pp

xxxx

xxxx

X

Centroid = Origin = 0

Var/Cov Matrix: S

Corr Matrix: R

TRASFORMATION: VARS CENTRED ON THEIR MEANS

npnn

p

p

xxx

xxx

xxx

21

22221

11211

X

Centroid = x

Var/Cov Matrix: S

Corr Matrix: R

Original Data Matrix Centred Data Matrix

The centred matrix is obtained by subtracting to each observation on a given variable the mean of the variable itself. This means that to all the observations on a given column, say the j-th, the mean of the j-th variable is subtracted.

Page 42: 8323 Stats - Lesson 1 - 04 Multivariate Vectors And Samples 2008

42

A closer look at the distanceThe Euclidean distance is the length of the line connecting a point to the origin. Consider, in the plot of the centred variables, Cyprus and Italy: their distance from the origin, 0, is (almost) the same.

This similar distance is due to different combinations of x- and y- deviations from 0. Should the x- and y- deviations be evaluated in the same manner?

Notice that the distance of Slovakia from the origin is higher. We will consider this later

Page 43: 8323 Stats - Lesson 1 - 04 Multivariate Vectors And Samples 2008

43

A closer look at the distanceRemember: the standard deviation of a variable is the typical deviation from the mean. Here Std.Dev.(E_gov_Avail)=15, Std.Dev.(Int_Acc) = 21.31.To compare adequately the deviations from the origin (data are centred), we should take into account the Std.Dev (of course, squared deviations should be compared with variances).

Internet_Acc has an higher std.dev. Hence, a deviation D from the origin along the horizontal axis should “count less” than a deviation D from the origin along the vertical axis.

Page 44: 8323 Stats - Lesson 1 - 04 Multivariate Vectors And Samples 2008

44

A closer look at the distanceIn the Euclidean distance, the deviations are considered in absolute terms. When we are considering variables having different Std.Dev, we should consider relative deviations. To remove the effect of Std. Dev, thus obtaining comparable deviations, we have to standardize the variables.

The Euclidean Distance between two standardized observations is:

Statistical Distance: A different weight is assigned to the squared deviation of each variable in the calculation of the distance (1/sjj). The statistical distance is proportional to the Euclidean one only if the variances are all equal.

jj

jijij

s

xxz

Standardization of the j-th variable:

),(...

)(...)()(),(

22

22

22

2

11

11

2222

211

hiS

pp

hpiphihi

hpiphihihiE

Ds

xx

s

xx

s

xx

zzzzzzD

xx

zz

Page 45: 8323 Stats - Lesson 1 - 04 Multivariate Vectors And Samples 2008

45

A closer look at the distanceThe statistical distance (visualization in the original/centred space). x-deviations are penalized less than y-deviations, since the x-axis is characterized by an higher dispersion. Hence Cyprus, which is showing an higher y-deviation from the origin as compared to Italy is characterized by a statistical distance from the origin which is higher than that characterizing Italy.

Points having the same statistical distance from the origin

Notice that Slovakia has a stat. distance from 0 which is now similar to that of Cyprus.

Page 46: 8323 Stats - Lesson 1 - 04 Multivariate Vectors And Samples 2008

46

Multivariate Samples - Transformations

Centroid = Origin = 0

Var/Cov Matrix: R

Corr Matrix: R

TRASFORMATION: STANDARDIZED VARS

npnn

p

p

xxx

xxx

xxx

21

22221

11211

X

Centroid = x

Var/Cov Matrix: S

Corr Matrix: R

Original Data Matrix Standardized Data Matrix

The standardized matrix is obtained by subtracting to each observation on a given variable the mean of the variable itself and by dividing this difference by the Std.Dev. The centred vars have null mean, the standardized vars have variances all equal to 1 (the unit of measurement is removed). Since Variance=Std.Dev= 1 for each variable, the covariances coincide with correlations (Corr=Cov/Product of Std.Dev’s).

pp

pnpn

pp

pp

s

xx

s

xx

s

xx

s

xx

)()(

)()(

11

11

1

11

111

Z

Page 47: 8323 Stats - Lesson 1 - 04 Multivariate Vectors And Samples 2008

48

A closer look at the distance In statistical distance deviations are adjusted by taking into account dispersions of the variables. But no attention is posed on the “coherence” between each point and the cloud of points (standardization does not involve correlations) Slovakia and Cyprus are equally statistically distant from the origin.

Notice that Lithuania is more statistically distant from the origin.

Consider the orientation of the cloud: the line connecting Lithuania to 0 has the same direction of the cloud. This is less true for Slovakia. The line connecting Cyprus to the origin is in countertendency

Page 48: 8323 Stats - Lesson 1 - 04 Multivariate Vectors And Samples 2008

49

A closer look at the distanceIn Statistical distance, the coherence with the orientation of the cloud is not considered. A transformation of data which removes the effect of Std. Dev, and also penalizes deviations by considering the orientation of the cloud of points id the so called Mahalanobis transformation. We do not enter into details here.

The so called Mahalanobis distance is defined as the Euclidean distance calculated on Mahalanobis transformed observations:

)()( ijijMij xMxMahalz Mahalanobis transf. of the j-th variable:

),()()(...)()(

)(...)()(),(

2211

2222

211

hiMhpiphi

hpiphihihiE

DxMxMxMxM

zzzzzzD

xx

zz

MMMMMMMM

The Mahalanobis transformation is a particular linear combination of the considered variables.

Page 49: 8323 Stats - Lesson 1 - 04 Multivariate Vectors And Samples 2008

50

Multivariate Samples - TransformationsTRASFORMATION: MAHALANOBIS

Centroid = Origin = 0

Var/Cov Matrix: I

Corr Matrix: I

npnn

p

p

xxx

xxx

xxx

21

22221

11211

X

Centroid = x

Var/Cov Matrix: S

Corr Matrix: R

Original Data Matrix Mahalanobis Data Matrix

The Mahalanobis distance is the Euclidean distance evaluated by previously transforming data according to the Mahalanobis transformation.

The variables transformed according to the Mahalanobis transformation have null means, variances all equal to 1 (unit of measurement is removed), and null correlations (orientation of the cloud is removed).

)()(

)()(

1

111

npn

p

xMxM

xMxM

MZ

Page 50: 8323 Stats - Lesson 1 - 04 Multivariate Vectors And Samples 2008

51

A closer look at the distance Mahalanobis Distance: deviations from the origin are adjusted by taking into account both the dispersions of variables and their correlations (orientation). Now Cyprus, being in countertendency with respect to the orientation of the cloud is characterized by a Mahalanobis distance from 0 which is higher than that characterizing Slovakia.

Notice that Lithuania has a Mahalan. distance from 0 similar to that of Slovakia.

Points having the same Mahalanobis distance from the origin

Page 51: 8323 Stats - Lesson 1 - 04 Multivariate Vectors And Samples 2008

53

Multivariate samples – Transformations

ORIGINAL CENTRED ON MEAN STANDARDIZATION MAHALANOBIS

X Z ZM

Means 0 0 0Variances sjj sjj 1 1

Covariances sjk sjk rjk 0

Correlations rjk rjk rjk 0

Euclidean distance

Euclidean Euclidean Statistical Mahalanobis

X~

jx

Conclusion: By transforming data via standardization or Mahalanobis transformation we are simply defining a new space such that the Euclidean Distance calculated on the transformed points coincides respectively with:

Statistical distance - standardization, deviations are differently evaluated depending on their Std.Dev

Mahalanobis distance - Mahalanobis transformation, deviations are differently evaluated depending on the Std.Dev.’s and to the orientation of the cloud - correlations/covariances).

As for now the latter transformation was not explicitly defined due to its analytical complexity, but we will see later how to obtain Mahalanobis-transformed data.


Recommended