Correlation and Regression With R

8/16/2019 Correlation and Regression With R

1/17

1 | BPS651 Research Methodology -Laboratory Exercise

BPS651 Department of Mathematics, Statistics & Computer Science

Laboratory Exercise IIICourse BPS651 Research Methodology

R.S. RajputAssistant Professor Computer Science

Correlations

Correlation is used to test for a relationship between two numerical variables or tworanked (ordinal) variables.

Correlation is a bivariate analysis that measures the strengths of association between two

variables. In statistics, the value of the correlation coefficient varies between +1 and -1.When the value of the correlation coefficient lies around ± 1, then it is said to be a perfectdegree of association between the two variables. As the correlation coefficient value goestowards 0, the relationship between the two variables will be weaker. Usually, in statistics,we measure three types of correlations:

Pearson correlation Kendall rank correlation Spearman correlation

Pearson r correlation: Pearson r correlation is widely used in statistics to measure thedegree of the relationship between linear related variables. For example, in the stockmarket, if we want to measure how two commodities are related to each other, Pearson r correlation is used to measure the degree of relationship between the two commodities.The following formula is used to calculate the Pearson r correlation:-

Where:r = Pearson r correlation coefficientN = number of value in each data∑xy = sum of the products of paired scores ∑x = sum of x scores ∑y = sum of y scores ∑x2= sum of squared x scores∑y2= sum of squared y scores

For the Pearson r correlation, both variables should be normally distributed. Otherassumptions include linearity and homoscedasticity. Linearity assumes a straight linerelationship between each of the variables in the analysis and homoscedasticity assumes

that data is normally distributed about the regression line.


2/17



Kendall rank correlation : Kendall rank correlation is a non-parametric test that measuresthe strength of dependence between two variables. If we consider two samples, a and b,where each sample size is n, we know that the total number of pairings with a b is n(n-1)/2 . The following formula is used to calculate the value of Kendall rank correlation:

Where:Nc= number of concordantNd= Number of discordantConcordant: Ordered in the same way Discordant: Ordered differently

Spearman rank correlation : Spearman rank correlation is a non-parametric test that isused to measure the degree of association between two variables. Spearman rankcorrelation test does not assume any assumptions about the distribution of the data and isthe appropriate correlation analysis when the variables are measured on a scale that is atleast ordinal.The following formula is used to calculate the Spearman rank correlation:

Where:P= Spearman rank correlationd i= the difference between the ranks of corresponding values X i and Y in= number of value in each data set

Correlations with RThe cor( ) function to produce correlations. A simplified format is:-

cor(x, use=, method= )

where

Option Description

x Matrix or data frame

use Specifies the handling of missing data. Options are all.obs (assumes no missing data, *missing data will produce an error), complete.obs (listwise deletion), andpairwise.complete.obs (pairwise deletion)

method Specifies the type of correlation. Options are pearson , spearman or kendall .


3/17



Visualizing Correlations

Simple Scatterplot

The basic function is plot( x , y ) that is used to plot scatter plot, where x and y are numeric vectors denoting

the (x,y) points to plot.

Example

>plot(wt, mpg, main="Scatterplot Example", xlab="Car Weight ", ylab="Miles Per Gallon ", pch=19)

# Add fit lines

abline(lm(mpg~wt), col="red") # regression line (y~x)

lines(lowess(wt,mpg), col="blue") # lowess line (x,y)


4/17



Scatterplot Matrices

pairs() to create scatterplot matrices

# Basic Scatterplot Matrix

pairs(~mpg+disp+drat+wt,data=mtcars,

main="Simple Scatterplot Matrix")

Exercise 10

Protein intake X and fat intake Y (in gm) for ten old women given as

X 56,47,33,39,42,38,46,47,38,32

Y 56,83,49,52,65,52,56,48,59,70

Calculate correlation Coefficient ( Pearson) , draw scatter plot matrix and scatter plot

Exercise 11

Find correlation coefficient ( Pearson) between the sales and expenses from the data given below:

Firm: 1 2 3 4 5 6 7 8 9 10

Sales (Rs Lakhs): 50 50 55 60 65 65 65 60 60 50

Expenses (Rs Lakhs): 11 13 14 16 16 15 15 14 13 13

Draw scatter plot matrix, and scatter plot


5/17



Regression

Simple Linear Regression

A simple linear regression model that describes the relationship between two variables xand y can be expressed by the following equation. The numbers α and β are calledparameters , and ϵ is the error term .

For example, in the data set faithful, it contains sample data of two random variablesnamed waiting and eruptions. The waiting variable denotes the waiting time until the nexteruptions, and eruptions denote the duration. Its linear regression model can be expressedas:

Topic to be studied Estimated Simple Regression Equation Coefficient of Determination Significance Test for Linear Regression Confidence Interval for Linear Regression Prediction Interval for Linear Regression Residual Plot Standardized Residual

Normal Probability Plot of Residuals>data()>faithful>head(faithful)

Estimated Simple Regression Equation

If we choose the parameters α and β in the simple linear regression model so as tominimize the sum of squares of the error term ϵ , we will have the so called estimatedsimple regression equation. It allows us to compute fitted values of y based on values of x .

Problem: Apply the simple linear regression model for the data set faithful , and estimatethe next eruption duration if the waiting time since the last eruption has been 80 minutes.Solution : We apply the lm function to a formula that describes the variable eruptions bythe variable waiting , and save the linear regression model in a new variable eruption.lm .

>eruption.lm=lm(eruptions~waiting,data=faithful)>eruption.lm

Then we extract the parameters of the estimated regression equation with the coefficientsfunction.>coeffs=coefficients(eruption.lm)


6/17



>coeffs>(Intercept) waiting-1.874016 0.075628

We now fit the eruption duration using the estimated regression equation.

>waiting=80 #the waiting time>duration=coeffs[1]+coeffs[2]*waiting>duration(Intercept)4.1762

Answer: Based on the simple linear regression model, if the waiting time since the lasteruption has been 80 minutes, we expect the next one to last 4.1762 minutes.

Coefficient of Determination

The coefficient of determination of a linear regression model is the quotient of thevariances of the fitted values and observed values of the dependent variable. If we denote y i as the observed values of the dependent variable, as its mean, and as the fitted value,then the coefficient of determination is:-

Problem: Find the coefficient of determination for the simple linear regression model ofthe data set faithful.Solution: We apply the lm function to a formula that describes the variable eruptions bythe variable waiting, and save the linear regression model in a new variable eruption.lm.

>eruption.lm=lm(eruptions~waiting,data=faithful)

Then we extract the coefficient of determination from the r.squared attribute of itssummary.

>summary(eruption.lm)$r.squared

[1]0.81146 Answer: The coefficient of determination of the simple linear regression model for thedata set faithful is 0.81146.

Significance Test for Linear Regression

Assume that the error term ϵ in the linear regression model is independent of x , and isnormally distributed, with zero mean and constant variance. We can decide whether thereis any significant relationship between x and y by testing the null hypothesis that β = 0.Problem: Decide whether there is a significant relationship between the variables in the

linear regression model of the data set faithful at .05 significance level.Solution: We apply the lm function to a formula that describes the variable eruptions bythe variable waiting, and save the linear regression model in a new variable eruption.lm.


7/17



>eruption.lm=lm(eruptions~waiting,data=faithful)

Then we print out the F-statistics of the significance test with the summary function.

>summary(eruption.lm)Call:lm(formula=eruptions~waiting,data=faithful) Residuals:

Min 1Q Median 3Q Max-1.2992 -0.3769 0.0351 0.3491 1.1933

Coefficients:Estimate Std. Error t value Pr(>|t|)

(Intercept) -1.87402 0.16014 -11.7 newdata=data.frame(waiting=80)

We now apply the predict function and set the predictor variable in the newdata argument.We also set the interval type as "confidence", and use the default 0.95 confidence level.

>predict(eruption.lm, newdata, interval="confidence")fit lwr upr


8/17



1 4.1762 4.1048 4.2476>detach(faithful) # clean up

Answer: The 95% confidence interval of the mean eruption duration for the waiting timeof 80 minutes is between 4.1048 and 4.2476 minutes.

Prediction Interval for Linear RegressionAssume that the error term ϵ in the simple linear regression model is independent of x , andis normally distributed, with zero mean and constant variance. For a given value of x , theinterval estimate of the dependent variable y is called the prediction interval.Problem: In the data set faithful, develop a 95% prediction interval of the eruptionduration for the waiting time of 80 minutes.Solution: We apply the lm function to a formula that describes the variable eruptions bythe variable waiting, and save the linear regression model in a new variable eruption.lm.

>attach(faithful) # attach the data frame>eruption.lm = lm(eruptions ~ waiting)

Then we create a new data frame that set the waiting time value.

>newdata = data.frame(waiting=80)

We now apply the predict function and set the predictor variable in the newdata argument.We also set the interval type as "predict", and use the default 0.95 confidence level.

>predict(eruption.lm, newdata, interval="predict")

fit lwr upr1 4.1762 3.1961 5.1564>detach(faithful) # clean up

Answer: The 95% prediction interval of the eruption duration for the waiting time of 80minutes is between 3.1961 and 5.1564 minutes.

Residual PlotThe residual data of the simple linear regression model is the difference between theobserved data of the dependent variable y and the fitted values ŷ .

Problem: Plot the residual of the simple linear regression model of the data set faithfulagainst the independent variable waiting.Solution: We apply the lm function to a formula that describes the variable eruptions bythe variable waiting, and save the linear regression model in a new variable eruption.lm.Then we compute the residual with the resid function.

>eruption.lm = lm(eruptions ~ waiting, data=faithful)>eruption.res = resid(eruption.lm)

We now plot the residual against the observed values of the variable waiting.


9/17



>plot(faithful$waiting, eruption.res,+ ylab="Residuals", xlab="Waiting Time",+ main="Old Faithful Eruptions")>abline(0, 0) # the horizon

Standardized ResidualThe standardized residual is the residual divided by its standard deviation.

Problem: Plot the standardized residual of the simple linear regression model of the dataset faithful against the independent variable waiting.Solution: We apply the lm function to a formula that describes the variable eruptions bythe variable waiting, and save the linear regression model in a new variable eruption.lm.Then we compute the standardized residual with the rstandard function.

>eruption.lm = lm(eruptions ~ waiting, data=faithful)>eruption.stdres = rstandard(eruption.lm)

We now plot the standardized residual against the observed values of the variable waiting.

>plot(faithful$waiting, eruption.stdres,+ ylab="Standardized Residuals",+ xlab="Waiting Time",

+ main="Old Faithful Eruptions")>abline(0, 0) # the horizon


10/17



Normal Probability Plot of ResidualsThe normal probability plot is a graphical tool for comparing a data set with the normaldistribution. We can use it with the standardized residual of the linear regression modeland see if the error term ϵ is actually normally distributed.Problem: Create the normal probability plot for the standardized residual of the data setfaithful.Solution: We apply the lm function to a formula that describes the variable eruptions bythe variable waiting, and save the linear regression model in a new variable eruption.lm.Then we compute the standardized residual with the rstandard function.

>eruption.lm = lm(eruptions ~ waiting, data=faithful)>eruption.stdres = rstandard(eruption.lm)

We now create the normal probability plot with the qqnorm function, and add the qqlinefor further comparison.

> qqnorm(eruption.stdres,+ ylab="Standardized Residuals",+ xlab="Normal Scores",+ main="Old Faithful Eruptions")> qqline(eruption.stdres)

Exercise 12

Geographical area x and area under paddy cultivated y ( in hectares) for 15 villages of a

district are given below-

X 103,106,120,120,100,151,160,155,136,178,196,140,160,166,112

Y 041,033,087,078,035,081,090,085,070,100,102,070,082,085,050

Calculate correlation coefficient, Calculate regression equation of y on x, Calculate

Coefficient of determination, display significance test.

Estimate paddy cultivation where geographical area is 136 hectares, develop a 95%

confidence interval & 95% prediction interval of the mean y for 136 hectares.

Draw Residual plot, Standardized residual plot and Normal probability plot of residuals.

Exercise 13


11/17



Calculate correlation coefficient between marks obtained in 1 st pre-final and 2 nd pre-final

examination on the basis of the following data collected for a sample of 12 students

I 12,14,9.5,10.5,8,11.5,10,14,8,9.5,11,12

II 11.5,13.5,12,14,7,14,8,12.5,6.5,10,9,12Calculate correlation coefficients, Calculate regression equation of y(i.e. II) on x(i.e. I),

Calculate Coefficient of determination, display significance test.


Exercise 14

Twelve students for the following percentage of makes in Physics & Statistics calculate:-Correlation coefficient

Linear regression equation of y on x

X 73,42,88,38,68,75,80,54,64,48,35,37

Y 73,48,86,58,65,60,76,54,50,38,32,30

Calculate Coefficient of determination, display significance test.


Estimate Y where X is 80, develop a 95% confidence interval & 95% prediction interval of

the mean Y for 80 hectares.


12/17



Multiple Linear Regressions

A multiple linear regression (MLR) model that describes a dependent variable y byindependent variables x 1, x 2, ..., x p ( p > 1) is expressed by the equation as follows, where the

numbers α and β k (k = 1, 2, ..., p) are the parameters, and ϵ is the error term.

For example, in the built-in data set stackloss from observations of a chemical plantoperation, if we assign stackloss as the dependent variable, and assign Air.Flow (cooling airflow), Water.Temp (inlet water temperature) and Acid.Conc. (acid concentration) asindependent variables, the multiple linear regression model is:

>data() >stackloss

Topic to be studied

Estimated Multiple Regression Equation Multiple Coefficient of Determination Adjusted Coefficient of Determination Significance Test for MLR Confidence Interval for MLR Prediction Interval for MLR

Estimated Multiple Regression EquationIf we choose the parameters α and β k (k = 1, 2, ..., p) in the multiple linear regression modelso as to minimize the sum of squares of the error term ϵ , we will have the so calledestimated multiple regression equation . It allows us to compute fitted values of ybased on a set of values of x k (k = 1, 2, ..., p) .

Problem: Apply the multiple linear regression model for the data set stackloss, and predictthe stack loss if the air flow is 72, water temperature is 20 and acid concentration is 85.Solution: We apply the lm function to a formula that describes the variable stack.loss bythe variables Air.Flow, Water.Temp and Acid.Conc. And we save the linear regressionmodel in a new variable stackloss.lm.

>stackloss.lm=lm(stack.loss~+Air.Flow+Water.Temp+Acid.Conc.,data=stackloss)>stackloss.lm We also wrap the parameters inside a new data frame named newdata.

>newdata= data.frame(Air.Flow=72, Water.Temp=20, Acid.Conc.=85)

Lastly, we apply the predict function to stackloss.lm and newdata.


13/17



> predict(stackloss.lm, newdata)1

24.582

Answer: Based on the multiple linear regression model and the given parameters, thepredicted stack loss is 24.582.

Multiple Coefficient of DeterminationThe coefficient of determination of a multiple linear regression model is the quotient of thevariances of the fitted values and observed values of the dependent variable. If we denote y i as the observed values of the dependent variable, as its mean, and as the fitted value,then the coefficient of determination is:

Problem: Find the coefficient of determination for the multiple linear regression model ofthe data set stackloss.Solution: We apply the lm function to a formula that describes the variable stack.loss bythe variables Air.Flow, Water.Temp and Acid.Conc. And we save the linear regressionmodel in a new variable stackloss.lm.

> stackloss.lm = lm(stack.loss ~ Air.Flow + Water.Temp + Acid.Conc., data=stackloss)

Then we extract the coefficient of determination from the r.squared attribute of its

summary.

>summary(stackloss.lm)$r.squared[1] 0.91358

Answer: The coefficient of determination of the multiple linear regression model for thedata set stackloss is 0.91358.

Adjusted Coefficient of DeterminationThe adjusted coefficient of determination of a multiple linear regression model is definedin terms of the coefficient of determination as follows, where n is the number ofobservations in the data set, and p is the number of independent variables.

Problem: Find the adjusted coefficient of determination for the multiple linear regressionmodel of the data set stackloss.Solution: We apply the lm function to a formula that describes the variable stack.loss bythe variables Air.Flow, Water.Temp and Acid.Conc. And we save the linear regressionmodel in a new variable stackloss.lm.



14/17



Then we extract the coefficient of determination from the adj.r.squared attribute of itssummary.

> summary(stackloss.lm)$adj.r.squared[1] 0.89833

Answer: The adjusted coefficient of determination of the multiple linear regression modelfor the data set stackloss is 0.89833.

Significance Test for MLRAssume that the error term ϵ in the multiple linear regression (MLR) model is independentof x k (k = 1, 2, ..., p), and is normally distributed, with zero mean and constant variance. Wecan decide whether there is any significant relationship between the dependent variable yand any of the independent variables x k (k = 1, 2, ..., p).

Problem: Decide which of the independent variables in the multiple linear regressionmodel of the data set stackloss are statistically significant at .05 significance level.Solution : We apply the lm function to a formula that describes the variable stack.loss bythe variables Air.Flow, Water.Temp and Acid.Conc. And we save the linear regressionmodel in a new variable stackloss.lm.


The t values of the independent variables can be found with the summary function.

> summary(stackloss.lm)

Call:lm(formula = stack.loss ~ Air.Flow + Water.Temp + Acid.Conc., data = stackloss)

Residuals:Min 1Q Median 3Q Max

-7.238 -1.712 -0.455 2.361 5.698

Coefficients:Estimate Std. Error t value Pr(>|t|)

(Intercept) -39.920 11.896 -3.36 0.0038 **Air.Flow 0.716 0.135 5.31 5.8e-05 ***Water.Temp 1.295 0.368 3.52 0.0026 **Acid.Conc. -0.152 0.156 -0.97 0.3440---Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1

Residual standard error: 3.24 on 17 degrees of freedomMultiple R-squared: 0.914, Adjusted R-squared: 0.898F-statistic: 59.9 on 3 and 17 DF, p-value: 3.02e-09

Answer: As the p-values of Air.Flow and Water.Temp are less than 0.05, they are bothstatistically significant in the multiple linear regression model of stackloss.

Confidence Interval for MLR


15/17



Assume that the error term ϵ in the multiple linear regression (MLR) model is independentof x k (k = 1, 2, ..., p), and is normally distributed, with zero mean and constant variance. Fora given set of values of x k (k = 1, 2, ..., p), the interval estimate for the mean of thedependent variable, , is called the confidence interval.Problem : In data set stackloss, develop a 95% confidence interval of the stack loss if theair flow is 72, water temperature is 20 and acid concentration is 85.Solution: We apply the lm function to a formula that describes the variable stack.loss bythe variables Air.Flow, Water.Temp and Acid.Conc. And we save the linear regressionmodel in a new variable stackloss.lm.

> attach(stackloss) # attach the data frame> stackloss.lm = lm(stack.loss ~ Air.Flow + Water.Temp + Acid.Conc.)

Then we wrap the parameters inside a new data frame variable newdata.

> newdata = data.frame(Air.Flow=72, Water.Temp=20, Acid.Conc.=85)

We now apply the predict function and set the predictor variable in the newdata argument.We also set the interval type as "confidence", and use the default 0.95 confidence level.

> predict(stackloss.lm, newdata, interval="confidence")fit lwr upr

1 24.582 20.218 28.945> detach(stackloss) # clean up

Answer: The 95% confidence interval of the stack loss with the given parameters isbetween 20.218 and 28.945.

Prediction Interval for MLRAssume that the error term ϵ in the multiple linear regression (MLR) model is independentof x k (k = 1, 2, ..., p), and is normally distributed, with zero mean and constant variance. Fora given set of values of x k (k = 1, 2, ..., p), the interval estimate of the dependent variable y iscalled the prediction interval.Problem: In data set stackloss, develop a 95% prediction interval of the stack loss if the airflow is 72, water temperature is 20 and acid concentration is 85.Solution: We apply the lm function to a formula that describes the variable stack.loss bythe variables Air.Flow, Water.Temp and Acid.Conc. And we save the linear regressionmodel in a new variable stackloss.lm.

> attach(stackloss) # attach the data frame> stackloss.lm = lm(stack.loss ~ Air.Flow + Water.Temp + Acid.Conc.)

Then we wrap the parameters inside a new data frame variable newdata.

> newdata = data.frame(Air.Flow=72, Water.Temp=20, Acid.Conc.=85)

We now apply the predict function and set the predictor variable in the newdata argument.We also set the interval type as "predict", and use the default 0.95 confidence level.


16/17



> predict(stackloss.lm, newdata, interval="predict")fit lwr upr

1 24.582 16.466 32.697> detach(stackloss) # clean up

Answer: The 95% confidence interval of the stack loss with the given parameters isbetween 16.466 and 32.697.

Exercise 15

Following are the data on yield per plant (Y) days to maturity (x1) and length of earhead(x2) for 12 plants of a crop. To calculateMultiple Correlation CoefficientsMultiple regression equation of Y on X1 and X2

Y 10 20 30 50 70 90 100 130 140 150 155 160x1 50 50 50 51 52 53 54 55 55 55 56 58x2 1.0 1.2 1.5 2.0 2.5 3.0 3.3 4.0 5.0 6.0 6.5 7.5

Calculate Multiple Coefficient of determination, Adjusted Coefficient of determination,

display significance test for MLR.

Estimate Y where x1=53, x2=2, develop a 95% confidence interval & 95% prediction

interval of the mean y for x1=53, x2=2.

Exercise 16

Compute the values of correlation coefficients, multiple regression equation Calculate

Multiple Coefficient of determination, Adjusted Coefficient of determination, display

significance test for MLR for the following data.

X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 Y2931.5 1055 242 867.3 282 476 985 430 276 1093 16.51054.9 689 290 313.9 884 172 111 413 276 1473 16.04671.7 1102 132 394.5 297 166 674 441 276 1218 13.07090.0 1785 235 1209.7 257 299 534 475 276 1613 18.06280.4 993 337 2454.0 281 532 735 434 276 1066 18.52737.6 395 211 1002.0 293 155 492 570 280 1235 18.53176.9 836 441 2352.9 288 270 317 540 284 1305 17.52719.1 663 181 3083.0 270 229 440 344 281 1401 20.58274.9 1370 233 1239.5 281 497 938 441 376 1029 18.54571.7 766 196 1053.8 250 145 451 402 278 1095 17.58965.9 1030 254 853.5 283 164 360 391 277 1056 21.16907.3 1160 219 762.7 281 360 401 343 277 1062 19.58852.9 778 189 1023.0 280 388 347 365 285 1621 20.02469.2 822 185 942.0 281 350 340 360 292 1146 21.1

9861.1 1186 144 443.0 285 250 600 400 278 1246 13.5


17/17


BPS651 Department of Mathematics Statistics & Computer Science

End of Laboratory Exercise

Laboratory -IV

Logistic Regression

Analysis of Variance

Date post:	06-Jul-2018
Category:	Documents
Upload:	st
View:	221 times
Download:	0 times

Correlation and Regression With R

Documents