VECTOR PROJECTIONS
��
𝑦
90˚𝐿 �� , ��
𝐿𝑦 , 𝑥=�� ��‖��‖
MATRIX OPERATION: INVERSE MATRIX
Important for solving a set of linear equations, is the matrix operation that defines an inverse of a matrix.
X-1 : Inverse matrix of XX-1 X = I
where I is the identity matrix:all entries on the diagonal are 1,
all others 0
( here for 3 x 3 matrix)
MATRIX OPERATION:Important for solving a set of linear equations, is the matrix operation that defines an inverse of a matrix.
X-1 : Inverse matrix of X
Not all matrices have an inverse matrixand there is not a simple rule how to calculate the entries in an inverse matrix!
We skip the formal mathematical aspects and note here only the important facts:
For symmetric square matrices like covariance matrices or correlation matricesthe inverse exists
X-1 X = I where I is the identity matrix
SUMMARYSimple Linear Regression Principal Component Analysis
SUMMARY2-dimensional sample space:Simple Linear Regression:Minimizes the Summed Squared Errors (measured in the vertical direction between Fitted regression line and observed data points)
Principal Component Analysis:Finds the direction of vector that maximizes the variance that is projecting onto this vector.
REGRESSION ANALYSIS IN RSimple linear regression in R:
the function res<-lm( y ~ x ) calculates the linear regression lineIt returns a number of useful additional statistical measures of the quality of theregression line.
Regression line using res$fitted
Residuals (errors) res$residuals
Remember: We assumed that errorsare uncorrelated to the ‘predictor’ variable x. It is recommended to checkthat the errors itself do NOT have an organized structure when plotted over x.
Histogram of residuals (errors) hist(res$residuals)
Remember: We assumed that errorsare uncorrelated to the ‘predictor’ variable x. It is recommended to checkalso if the errors follow a Gaussian (bell-shaped) distribution.
Note: the function fgauss() is defined in myfunctions.R [call source(“scripts/myfunctions.R”)]
LINEAR REGRESSION STATISTICSWhen applying linear regression, a number of test statistics arecalculated in R’s lm() function.
RegressionParameter (slope)
Slope of regression line
Statisticalsignificance:The smaller thevalue, the higherthe significanceof the linear relationship(slope >0)
Correlation coefficient between the fitted y-values and observed y-values
LINEAR REGRESSION:USE THE LINEAR REGRESSION WITH CAUTION!
Outliers can have a large effectand suggest a linear relationshipwhere there is none! It can be tested for the influenceof single outlier observations.
The sample space is important!If you only observed x and y in a limited range or a subdomain of the sample space,
LINEAR REGRESSION:THE DANGER OF USING THE LINEAR REGRESSION!
Outliers can have a large effectand suggest a linear relationshipwhere there is none! It can be tested for the influenceof single outlier observations.
The sample space is important!If you only observed x and y in a limited range or a subdomain of the sample space, extrapolation can give misleading results
MULTIPLE LINEAR REGRESSION
Source: http://reliawiki.org/index.php/Multiple_Linear_Regression_Analysis (figures retrieved April 2014)
Predictand (e.g. Albany Airport Temperature anomalies)Predictors:
e.g.: Temperatures from nearby stationsor: Indices of Large-Scale Climate Modeslike El Nino Southern Oscillation, North Atlantic Oscillationor: prescribed time-dependent functions like linear trend,periodic oscillation, polynoms
Random Error(noise)
MULTIPLE LINEAR REGRESSION
Source: http://reliawiki.org/index.php/Multiple_Linear_Regression_Analysis (figures retrieved April 2014)
Write a set oflinear equationsfor each observationin the sample (e.g.for each year of temperature observations
MULTIPLE LINEAR REGRESSION
Source: http://reliawiki.org/index.php/Multiple_Linear_Regression_Analysis (figures retrieved April 2014)
𝑦=𝑋 ��+ ��Or in short Matrix notation
MULTIPLE LINEAR REGRESSION
Source: http://reliawiki.org/index.php/Multiple_Linear_Regression_Analysis (figures retrieved April 2014)
𝑦=𝑋 ��+ ��
The mathematical problem we need to solve is:
Given all the observations of the predictand (stored in vector ) and thepredictor variables stored in matrix X, we want to find simultaneously a for each predictor variable a proper scaling factor, such that the fitted estimatedvalues minimize the sum of the squared errors.
size of the vectors / matrices: n x 1 n x k k x 1 n x 1
MULTIPLE LINEAR REGRESSION
Source: http://reliawiki.org/index.php/Multiple_Linear_Regression_Analysis (figures retrieved April 2014)
𝑦=𝑋 ��+ ��
𝑦=𝑋 𝛽
𝛽= (𝑋𝑇 𝑋 )−1 𝑋𝑇 ��size of the vectors / matrices: k x 1 ( k x n n x k ) k x n n x 1 (k x k) (k x 1)
We find here the covariancematrix (scaled by n) of the predictor variables.The ‘-1’ indicatesanother fundamentally important matrix operation:The inverse of a matrix
Covariance(scaled by n)of all predictorswith the predictand
MULTIPLE LINEAR REGRESSION
Source: http://reliawiki.org/index.php/Multiple_Linear_Regression_Analysis (figures retrieved April 2014)
𝑦=𝑋 ��+ ��
𝑦=𝑋 𝛽
𝛽= (𝑋𝑇 𝑋 )−1 𝑋𝑇 ��size of the vectors / matrices: k x 1 ( k x n n x k ) k x n n x 1 (k x k) (kx1)
The resulting kx1 matrix (i.e. vector) contains a proper scaling factorfor each predictor.In other words: multiple linearregression is a weighted sumof the predictors (after conversion into units of the predictand y).
EXAMPLE MULTIPLE LINEAR REGRESSIONWITH 2 PREDICTORS
The scatter cloud shows a lineardependence of the values in y along the two predictordimensions x1 x2.
TIPS FOR MULTIPLE LINEAR REGRESSION (MLR)
General rule: work with as few predictors as possible. (every time you add a new predictoryou increase the risk of over-fitting the model)
Observe how good the fitted values and observed values match (correlation)
Choose predictors that provide independent information about the predictand
The problem of collinearity: If the predictors are all highly correlated among each otherthen the MLR can become very ambiguous (because it gets harder to calculate accurately the inverse of the covariance matrix)
Last but not least: the regression coefficients from the MLR are not ‘unique’. If you add or remove one predictor, all regression coefficients can change.
PRINCIPAL COMPONENT ANALYSIS Global Sea Surface Temperatures
From voluntary ship observationscolors show the percentage of monthswith at least one observation in a2 by 2 degree grid box.
From paper in Annual Review of Marine Science (2010)
PRINCIPAL COMPONENT ANALYSIS Global Sea Surface Temperatures
Climatology 1982-2008
Red areas mark regions with highestSST variability
PRINCIPAL COMPONENT ANALYSIS Global Sea Surface Temperatures
Principal Component Analysis (PCA)
(Empirical Orthogonal Functions (EOF))
The first leading Eigenvector Eigenvectors form nowgeographic pattern. Grids with highpositive values and large negative values are covarying out of phase (negative correlation). Green regionsshow small variations in this Eigenvector #1.
The Principal Component is a time series showing the temporal evolution of the SST variations. This mode is associated with the El Niño - Southern Oscillation