8/12/2019 Probability and Statstical Inference 7
1/68
PROBABILITY & STATISTICALINFERENCE LECTURE 9MSc in Computing (Data Analytics)
8/12/2019 Probability and Statstical Inference 7
2/68
Lecture Outline
! ANOVA versus Regression! Correlations! Simple Linear Regression! Multiple Regression! Section Takeaways
8/12/2019 Probability and Statstical Inference 7
3/68
Type of
AnalysisFactorResponse
ContinuousCategorical T-test/ANOVA
ContinuousSimple Linear
Regression
8/12/2019 Probability and Statstical Inference 7
4/68
AVOVA vs Simple Linear Regression
8/12/2019 Probability and Statstical Inference 7
5/68
Scatter Plot
"A scatter plotis a type of
chart using
Cartesian
coordinates to
display values
for two
continuous
variables for a
set of data
Y
x
8/12/2019 Probability and Statstical Inference 7
6/68
Describing Linear Relationships
! Correlation we can quantify the relationshipbetween two variables with correlation statistics
! Two variables are correlated if there is a linearrelationship between them
! We can further classify correlated variables accordingto the type of correlation:!Positive: One variable tends to increase in value as
the other increases in value
!Negative: One variable tends to decrease in value asthe other increases in value
!Zero: No linear relationship between the twovariables (uncorrelated)
8/12/2019 Probability and Statstical Inference 7
7/68
Pearson Correlation Coefficient
8/12/2019 Probability and Statstical Inference 7
8/68
How to Calculate Correlation?
! The correlation coefficient between two samples x1,x2, x3, .... xn and y1, y2, y3, .... yn is calculated with the
following formula:
8/12/2019 Probability and Statstical Inference 7
9/68
8/12/2019 Probability and Statstical Inference 7
10/68
8/12/2019 Probability and Statstical Inference 7
11/68
Caution Using Correlation
!"#$ &'(& ") *+(+ ,-(. (.' &+/' 0"$$'1+2"3
0"'40-'3( ") !"#$%
8/12/2019 Probability and Statstical Inference 7
12/68
Example: The Fast Mile Test
! You have been tasked by Team Ireland toanalyse data from a study conducted to
investigate how fast athletes bodies can
absorb and use up oxygen
! The results of this study will be used to helptrainers devise custom regimes for theirathletes
! A dataset has been gathered from 31athletes, each of whom performed a fast-
mile-testfor which their maximum pulse rate,rest pulse rate, run pulse rate, run time and
oxygen consumption were measured
8/12/2019 Probability and Statstical Inference 7
13/68
Example: The Fast Mile TestOxygen
Consumption Gender Age Weight Runtime
Rest
Pulse
Run
Pulse
Max
Pulse
44.609 Male 44 89.47 11.37 62 178 182
45.313 Male 40 75.07 10.07 62 185 185
54.297 Female 44 85.84 8.65 45 156 168
59.571 Male 42 68.15 8.17 40 166 172
49.874 Female 38 89.02 9.22 55 178 180
44.811 Female 47 77.45 11.63 58 176 176
45.681 Male 40 75.98 11.95 70 176 180
49.091 Male 43 81.19 10.85 64 162 170
39.442 Female 44 81.42 13.08 63 174 176
8/12/2019 Probability and Statstical Inference 7
14/68
Example: Runtime vs Oxygen
Consumption
8/12/2019 Probability and Statstical Inference 7
15/68
Demo
8/12/2019 Probability and Statstical Inference 7
16/68
Regression Model
Y
x
" Can wecapture the
relationship
between two
variables inthe scatter
plot?
8/12/2019 Probability and Statstical Inference 7
17/68
Regression Model
!5+&'* "3 (.' &0+6'$ 71"(8 -( -& 7$"9+91: $'+&"3+91' ("+&/' (.+( (.' $+3*"/ ;+$-+91' ! -& $'1+('* ("# 9: +
&($+-
8/12/2019 Probability and Statstical Inference 7
18/68
Regression Model
Y
8/12/2019 Probability and Statstical Inference 7
19/68
Y
Regression Model
One unit
change in
x
!1
8/12/2019 Probability and Statstical Inference 7
20/68
Simple Linear Regression
! The case of simple linear regression considers asingle regressor(or predictor),x, and a dependent
(or response) variable, Y
! The expected value of Yat each level ofxis arandom variable:
! >' +&/' (.+( '+0. "9&'$;+2"38 !8 0+3 9'*'&0$-9'* 9: (.' /"*'1A
8/12/2019 Probability and Statstical Inference 7
21/68
Simple Linear Regression
! Suppose that we have npairs of observations (x1, y1), (x2,y2), , (xn, yn)
8/12/2019 Probability and Statstical Inference 7
22/68
Simple Linear Regression
! Suppose that we have npairs of observations (x1, y1), (x2,y2), , (xn, yn)
" Deviations of the datafrom the estimated
regression model
8/12/2019 Probability and Statstical Inference 7
23/68
Simple Linear Regression
" Suppose that we have n pairs of observations (x1,y1), (x2, y2), , (xn, yn)
" Deviations of the datafrom the estimated
regression modelObserved
value (y)
Estimated
regression
line
8/12/2019 Probability and Statstical Inference 7
24/68
Simple Linear Regression
! Suppose that we have npairs of observations (x1, y1), (x2,y2), , (xn, yn)
" The method of leastsquares is used to
estimate the
parameters, !0 and !1
by minimizing the sum
of the squares of the
vertical deviations in
diagram below
Observed
value (y)
Estimated
regression
line
8/12/2019 Probability and Statstical Inference 7
25/68
Example: Oxygen Consumption vs
Runtime for Team Ireland" Can we capture
the relationshipbetween Oxygen
Consumption and
Runtime in the
Team Irelandfitness study?
E l O C ti
8/12/2019 Probability and Statstical Inference 7
26/68
Example: Oxygen Consumption vs
Runtime for Team Ireland Regression
Model"
Yes, using theregression model:
where Y is the
OxygenConsumption and
x is the Runtime
for an athlete
8/12/2019 Probability and Statstical Inference 7
27/68
Model Assumptions
! Fitting a regression model requires severalassumptions:
! Errors are uncorrelated random variables with zero mean! Errors have constant variance!
Errors are normally distributed! The analyst should always consider the validity of
these assumptions to be doubtful and conduct analyses
to examine the adequacy of the model
8/12/2019 Probability and Statstical Inference 7
28/68
Testing Assumptions Residual Analysis
!The residuals from a regression model are:
where yiis an actual observation and iis thecorresponding fitted value from the regression
model! Analysis of the residuals is frequently helpful in
checking the assumption that the errors areapproximately normally distributed with constant
variance, and in determining whether additionalterms in the model would be useful
8/12/2019 Probability and Statstical Inference 7
29/68
Interpreting Residual Plots
B+2&)+0("$:
ei
0
ei
0
ei
0
ei
0
!#33'1
C"#91' 5", D"3=1-3'+$
8/12/2019 Probability and Statstical Inference 7
30/68
Example: Oxygen Consumption vs
Runtime for Team Ireland Residual Plot
" What do wethink?
8/12/2019 Probability and Statstical Inference 7
31/68
Adequacy of the Regression Model! The quantity:
is called the coefficient of determination and is oftenused to judge the adequacy of a regression model (0
!R2!1)
! We often refer (loosely) to R2as the amount ofvariability in the data explained or accounted for bythe regression model
8/12/2019 Probability and Statstical Inference 7
32/68
Example: Oxygen Consumption vs
Runtime for Team Ireland R2
! For the oxygen consumption regression modelR2 = SSM / SST
= 632.9 / 851.38
= 0.7434! Thus, the model accounts for 74.34% of the
variability in the data
8/12/2019 Probability and Statstical Inference 7
33/68
Adjusted R-squared Value
! The Adjusted R-squared Value is calculated asfollows:
! The figure is adjusted for to take into consideration
the number of factors in the model
8/12/2019 Probability and Statstical Inference 7
34/68
Demo
8/12/2019 Probability and Statstical Inference 7
35/68
Multiple Regression Models
! Many applications of regression analysis involvesituations in which there is more than one regressor
variable
! A regression model that contains more than oneregressor variable is called a multiple regressionmodel
! @.' /#1271' 1-3'+$ $'
8/12/2019 Probability and Statstical Inference 7
36/68
Example: Oxygen Consumption vs Runtime for Team
Ireland Regression Model
! For example, suppose that we want to test theaffect of both age and runtime on oxygen
consumption in the Team Ireland example
where:
Y : Oxygen Consumptionx
1: Runtime
x2 : Age
8/12/2019 Probability and Statstical Inference 7
37/68
Example: Oxygen Consumption vs Runtime for Team
Ireland Regression Model
" This is a 3d scatterplot of Oxygen
Consumption versus
Runtime and Age
8/12/2019 Probability and Statstical Inference 7
38/68
Example: Oxygen Consumption vs Runtime for Team
Ireland Regression Model
@.' $'
8/12/2019 Probability and Statstical Inference 7
39/68
Demo
8/12/2019 Probability and Statstical Inference 7
40/68
Regression & Variable Selection
!How do we select the best variable for use in aregression model
!Perform a search to see which variable are themost effective
!Three search schemes:! Forward sequential selection! Backward sequential selection! Stepwise sequential selection
8/12/2019 Probability and Statstical Inference 7
41/68
Sequential Selection Forward
Entry CutoffInput p-value
8/12/2019 Probability and Statstical Inference 7
42/68
Sequential Selection Forward
Entry CutoffInput p-value
8/12/2019 Probability and Statstical Inference 7
43/68
Sequential Selection Forward
Entry CutoffInput p-value
8/12/2019 Probability and Statstical Inference 7
44/68
Sequential Selection Forward
8/12/2019 Probability and Statstical Inference 7
45/68
Sequential Selection Backward
Stay CutoffInput p-value
8/12/2019 Probability and Statstical Inference 7
46/68
Sequential Selection Backward
Stay CutoffInput p-value
8/12/2019 Probability and Statstical Inference 7
47/68
Sequential Selection Backward
Stay CutoffInput p-value
8/12/2019 Probability and Statstical Inference 7
48/68
Sequential Selection Backward
Stay CutoffInput p-value
8/12/2019 Probability and Statstical Inference 7
49/68
Sequential Selection Backward
Stay CutoffInput p-value
8/12/2019 Probability and Statstical Inference 7
50/68
Sequential Selection Backward
Stay CutoffInput p-value
8/12/2019 Probability and Statstical Inference 7
51/68
Sequential Selection Backward
Stay CutoffInput p-value
8/12/2019 Probability and Statstical Inference 7
52/68
Sequential Selection Backward
Stay CutoffInput p-value
8/12/2019 Probability and Statstical Inference 7
53/68
Sequential Selection Stepwise
Input p-value Entry Cutoff
Stay Cutoff
8/12/2019 Probability and Statstical Inference 7
54/68
Sequential Selection Stepwise
Input p-value Entry Cutoff
Stay Cutoff
8/12/2019 Probability and Statstical Inference 7
55/68
Sequential Selection Stepwise
Input p-value Entry Cutoff
Stay Cutoff
8/12/2019 Probability and Statstical Inference 7
56/68
Sequential Selection Stepwise
Input p-value Entry Cutoff
Stay Cutoff
8/12/2019 Probability and Statstical Inference 7
57/68
Sequential Selection Stepwise
Input p-value Entry Cutoff
Stay Cutoff
8/12/2019 Probability and Statstical Inference 7
58/68
Sequential Selection Stepwise
Input p-value Entry Cutoff
Stay Cutoff
8/12/2019 Probability and Statstical Inference 7
59/68
Sequential Selection Stepwise
Input p-value Entry Cutoff
Stay Cutoff
8/12/2019 Probability and Statstical Inference 7
60/68
Demo
8/12/2019 Probability and Statstical Inference 7
61/68
Multi-Collinearity
!Multi-Collinearity exists when two or moreindependent variables are used in regression
are correlated.
X2
8/12/2019 Probability and Statstical Inference 7
62/68
Demo
8/12/2019 Probability and Statstical Inference 7
63/68
Regression Bits and Pieces
!Polynomial regression
! Logistic Regression! Categorical Factors in Regression
8/12/2019 Probability and Statstical Inference 7
64/68
Polynomial Regression
!Polynomial regression models are widely usedwhen the response in curve-linear
! The general principles ofmultiple regression will apply
! The second degreepolynomial in one variable is:
8/12/2019 Probability and Statstical Inference 7
65/68
8/12/2019 Probability and Statstical Inference 7
66/68
Logistic Regression
!The equation for a logistic regression model is:
! Choose intercept and parameter estimates tomaximize
! This function is known as the log-likelihood function
!log(pi) +!log(1 pi)
8/12/2019 Probability and Statstical Inference 7
67/68
Categorical Factors in Regression
#Many problems may involve categoricalvariables.
# The usual method for the different levels of aqualitative variable is to use indicator
variables.
# For example, to introduce the variable genderinto the model , we could define an indicator
variable as follows:
8/12/2019 Probability and Statstical Inference 7
68/68
Section Takeaways
!Regression models allow us model betweenvariables
! Regression models can be used to evaluate thevariation between variables but are also excellent
to use as prediction models