Date post: | 10-Apr-2018 |
Category: |
Documents |
Upload: | sanjeev-nawani |
View: | 218 times |
Download: | 0 times |
of 34
8/8/2019 (2) Regression
1/34
Graphical and NumericalDescription of Interval/RatioData: Scatter plot and LinearRegression.
DR. TAREK TAWFIK
8/8/2019 (2) Regression
2/34
9/15/2010 Dr Tarek Amin 2
RationaleThe data of the two variable under
investigation have been collected at the
, which usually means ainterval/ratio level
cross tabulations,large number of values
are not a convenient means of describing
the distribution. The equivalentscatter plot.technique is the use of
8/8/2019 (2) Regression
3/34
9/15/2010 Dr Tarek Amin 3
Scatter plotdisplays the jointA scatter plot
distribution for two continuousvariables.
on a scatter plotCoordinatesindicate the value each case takes
for each of the two variables.
8/8/2019 (2) Regression
4/34
9/15/2010 Dr Tarek Amin 4
The relation between unemployment
and crime rate
Crime rate.(Y)
Unemploymentrate. (X)
City(no.=5)
17
15
10
54
25
13
5
102
A
B
C
DE
Independent variable Dependent variable
8/8/2019 (2) Regression
5/34
9/15/2010 Dr Tarek Amin 5
Scatter plot
What this graph cantell?
Relation between unemployment and crime rate
0
101
0
0 10 1 0 0unemployment rate (% )
Crimerate
No. crime rate
Linear (crime rate)
e
c
d
ba
8/8/2019 (2) Regression
6/34
9/15/2010 Dr Tarek Amin 6
Linear RegressionRegression analysis is simply the task offitting a line through a scatter plot of
the data.Best Fitscases thatAny line can be expressed in amathematical formula.Y=a bXWhere :
Y is the Dependent variable.
X is the Independent variable.
a is the Y-intercept (the value of Y when X is zero)
b is the slope of regression line.
indicates negative or positive association.
8/8/2019 (2) Regression
7/34
9/15/2010 Dr Tarek Amin 7
The line formulaThe previous formula elaborates that a
two factors:line is defined by
The starting point along the verticalaxis, a.
The slope of the line from this point b.
The value of b indicates the direction of
slope whether positive or negative or noassociation exists between the twovariables.
8/8/2019 (2) Regression
8/34
9/15/2010 Dr Tarek Amin 8
Scatter plots and Correlations
Three linesexhibiting positive, negativeand no correlation.
8/8/2019 (2) Regression
9/34
9/15/2010 Dr Tarek Amin 9
Identifying the regression line
To identify the line that best fits thescatter plot we combined twocharacters:
The point of origin along the Y-axis.But this is not enough to distinguish it from the multitudeof lines that can be start from the same point.
The sloping of the line, alone does notdistinguish it from all the others that could occupy thespace.
8/8/2019 (2) Regression
10/34
Criteria to identify the best
fit line
If we specify both the point of origin on the Y-
axis and the slope of the line from that point,we are able to identify uniquely any linewithin the space (the line of best fit).
Straight lines with the same value
for a but different values for b
Straight lines with the sameValue for b but different values for a
8/8/2019 (2) Regression
11/34
Unemployment and crime rate
(Y=5+0.6X)??????
The value for a (5) is the pointon the Y-axis where the linebegins, this the number ofcrimes we expect to find in a
city with an unemployment rateof zero.
The + sign means that the linehas a positive slope, whichindicates a positive correlationbetween these two variables.
The value of 0.6 for b is theslope or Coefficient of theregression line, by how muchcrime will increase ifunemployment increases by 1%.
The slope of any line, b = rise/run
Relation between unemployment and crime rate
0
5
10
15
20
0 5 10 15 20 25 30
unemployment rate (%)
Crime
rate
Rise=3
Run=5
Y=5+0.6X
8/8/2019 (2) Regression
12/34
9/15/2010 Dr Tarek Amin 12
THE RESIDUAL ERROR
Is the difference between the
observed value of th
e dependentvariable (crime rate ) and thevalue of the dependent variable
predicted by a regression line.
8/8/2019 (2) Regression
13/34
9/15/2010 Dr Tarek Amin 13
There is no straight line will passthrough all the points in a scatterplot, in fact, a good line might nottouch any of the points: there will bea gap between each plot and the
regression line. Unless a point fallsexactly on the line there will be aresidual value.
8/8/2019 (2) Regression
14/34
Residual error
D
Error=-6
Actual
=5
Expected=11.
.
..
.
Y=5+0.6X
5
10Unemployment rate %
Crime rate
11
For City D the line predictswith unemployment rate of
10 %, the number of crime will beY= 5+0.6X = 5+0.6(10)=11
Instead there were only 5 crimeSo the error=
e=Yactual-Yexpected = 5-11=-6
8/8/2019 (2) Regression
15/34
9/15/2010 Dr Tarek Amin 15
Ordinary least square regression(OLS).
The best line is that makes residuals assmall as possible (minimizing residuals).
Ordinary least square regression is a rulethat tell us to draw a line through a scatterplot to minimizes the sum of the squaredresiduals.
TheOLS regression line must pass througha point whose coordinates are the averagesof the dependent and independent
variables (Y, X)
8/8/2019 (2) Regression
16/34
9/15/2010 Dr Tarek Amin 16
The slope of the regression
line b formula.
!2
GG
KKGG
i
bii
8/8/2019 (2) Regression
17/34
9/15/2010 Dr Tarek Amin 17
Easier formula
22
ii
iiii
n
nb
GG
KGKG
8/8/2019 (2) Regression
18/34
CalculationsXi YiYi
2Xi2Crime rate YUnemployment
rateX
City
425
19550
50
8
289
225100
25
16
625
16925
100
4
17
1510
5
4
25
135
10
2
A
BC
D
E
XY=728Y2= 655X2= 923 Y = 51
Mean= 10.2
X=55
Mean=11
b= 5(728)-(55)(51)/5(923)-(55)2= +0.53This called the regression coefficient
8/8/2019 (2) Regression
19/34
9/15/2010 Dr Tarek Amin 19
The regression coefficient
Indicates by how many units thedependent variable will change, givena one-unit change in the independent
variable.
An increase in the unemployment
rate of 1 % is correlated with anincrease of 0.53 increase in the crimerate.
8/8/2019 (2) Regression
20/34
9/15/2010 Dr Tarek Amin 2
Prediction (determination of Y)
GK ba !
10.2-0.53(11)=4.4 The line of best fit:Y= 4.4+0.53X
GbaY s!
If we have another city with unemployment rate of18 % what is the best guess for the crime rateY= 4.4+0.53 (18) = 13.9
8/8/2019 (2) Regression
21/34
9/15/2010 Dr Tarek Amin 21
Pearsons product moment
correlation coefficient. The value of b does not indicate the strength
of the relationship because units of
measurements vary from one situation toanother.
To overcome these points we convert b intoa standardized measure of correlation called
the product moment correlation coefficient,Pearsons r, will range from -1 to +1regardless of the unit of measurements.
8/8/2019 (2) Regression
22/34
9/15/2010 Dr Tarek Amin 22
Formula and calculation of r
? A ? A22
KKGG
KKGG
!
ii
iir
r=0.81
8/8/2019 (2) Regression
23/34
9/15/2010 Dr Tarek Amin 23
The Coefficient of Determination
The predictive ability of the regression line willbe affected by how much the scores are tightly
packed or dispersed around th
e line.a
b
Predictive ability with greater confidence with a than b.Therefore we need some measure of how much of thevariation in the dependent variable is explained bya regression line = the coefficient of determination r2
8/8/2019 (2) Regression
24/34
9/15/2010 Dr Tarek Amin 24
The coefficient of determination
It is PRE measure of the amount ofvariation explained by a regression line, and
therefore gives a sense of how muchconfidence we should place in the accuracyof our prediction.
r2= 0.65 indicates that the least square
regression line explains 65 % of the varianceof the dependent variable (crime rate)relative to the variance explained by thehorizontal no relation line.
8/8/2019 (2) Regression
25/34
9/15/2010 Dr Tarek Amin 25
Multiple RegressionA real state agent wants toexplore the factorsaffecting the selling price
of a house. The agentcollects data on thesetwo variables for 12houses.
There is a relationship
between th
e selling priceand the house size, doesthis hold true for the 12houses?
House size
(squares)
Selling price
($,000)
2015
20
13
18
14
2816
24
20
23
25
260240
245
210
230
242
295235
287
252
270
275
8/8/2019 (2) Regression
26/34
9/15/2010 Dr Tarek Amin 26
ContinuedConduction of simple regression analysis using themethods of OLS produces the following results:
Y=157+4.88
r =0.92r2= 0.85 There is a positive relationship between house size and
selling price. For every one square increase in house size the selling
price increases by $ 4880. Th
e relationsh
ip is strong andh
igh
ly reliable for makingpredictions. The variation in house size does not perfectly predict
selling price, the coefficient of determination is high(0.85 ), but not equal to 1 . Therefore other factors alsoaffect the sale price of houses in this sample.
8/8/2019 (2) Regression
27/34
9/15/2010 Dr Tarek Amin 27
Scatter plot
Not all the data pointsLie right on the regressionline
The actual selling price = a+b (house size) +e (error term)
8/8/2019 (2) Regression
28/34
Why multiple regression?- We have three houses in the sample with equal sizesbut different selling prices, why? It is may be due to
regularlyor other factors thatrandom factors
impact on the prices of the houses; age of thehouse!!!.
- There is may be a negative relation between the ageof the house and the selling price, to investigate weuse multiple regression.
- Multivariate regression investigates the relationshipbetween two or more independent variables on asingle dependent variable
8/8/2019 (2) Regression
29/34
9/15/2010 Dr Tarek Amin 29
Multi-collinearityMultiple regression assumes that each ofthe independent variables is
independent of each other.
House size
Price
Age
8/8/2019 (2) Regression
30/34
9/15/2010 Dr Tarek Amin 3
Multiple regressionAge in yearsHouse size (squares)Selling price ($,000)
5
12
9
15
9
7
1
12
2
5
5
5
20
15
20
13
18
14
28
16
24
20
23
25
260
240
245
210
230
242
295
235
287
252
270
275
Selling price Y=a+b1 (house size)+b2 (age) +e
8/8/2019 (2) Regression
31/34
9/15/2010 Dr Tarek Amin 31
Multiple regressionof eachinfluenceTo weigh the
independent variable on the
dependent variable we calculateandregression coefficientthe
for eachpartial correlation
independent variables on thedependent variable.
8/8/2019 (2) Regression
32/34
Interpretation of multiple
regression.Allow us to make predictions for the dependentvariable based on the values of the independent
variable, in term of the original units of
measurement.
Regression
coefficient
Allows us to distinguish the relative importance ofeach independent variable in determining the value
of the dependent variable.
Standardized
coefficient
Indicates the strength of the relationship.R
Indicates the amount of variation in the dependent
variable explained by the combination of
independent variable. Whether it is a good predictor
of the dependent variable.
R-squared
8/8/2019 (2) Regression
33/34
9/15/2010 Dr Tarek Amin 33
Stepwise RegressionG It allows us determine which combination
of possible independent variables best
explains the dependent variable.G It does this by adding in and taking out
variables from the calculations accordingto whether each makes a statistically
significant change to the value of R-squared.
8/8/2019 (2) Regression
34/34
9/15/2010 Dr Tarek Amin 34
Thank you