Faculty Research Workshop February 19, 2014 Tom Lehman, Ph.D.
Professor, Department of Economics Indiana Wesleyan University
Slide 2
Introduction to Correlation Analysis and OLS OLS Multiple
Regression Analysis Basics OLS Hierarchical Multiple Regression
Definition, Purposes and Technique Examples of OLS Hierarchical
Modeling Q&A References
Slide 3
Correlation analysis hypothesis testing is designed to
investigate if two or more variables are correlated or co-vary
together, and if this covariance is statistically significant.
Examples: What is the relationship between housing costs and
monthly rental prices in urban housing markets? What is the
relationship between GDP growth and growth in capital investment
expenditures in the macroeconomy? What is the relationship between
education levels and the unemployment rate in an urban economy?
Does advertising expenditure increase sales revenues? If we spend
$100,000 on ad costs, what are predicted gross sales? What is the
relationship between county levels of educational attainment and
county household income?
Slide 4
Bivariate = two variables Exploring the relationship between a
dependent variable (DV) and a single independent variable (IV)
Ideally both variables should be interval or ratio level
(continuous) Categorical (nominal or ordinal) variables do not work
as well in regression analysis (exception: dummy variables in
multiple regression) Dependent Variable (DV): The variable
(measurable data used to operationalize a concept) that is thought
to depend upon or be influenced by another The variable the value
of which is being predicted or estimated Independent Variable (IV):
The variable that is thought to (hypothesized to) influence the
behavior of the DV. The IV is sometimes referred to as a predictor
variable; it may predict the behavior of the DV We utilize the
values of an IV to predict or estimate the value of a DV in a
regression equation: Y = a + bX
Slide 5
Y X values of the DV Y values of the IV X
Slide 6
Y X values of the DV Y values of the IV X The best fit line
(a.k.a. the predicted regression line) assumes a linear
relationship; it traces a path through the scatter plots that is,
on average, equidistant between each data point. OLS regression
attempts to minimize the sum of the squared distance between the
observed data points and the regression line of predicted
plots.
Slide 7
A computed value between -1.00 and +1.00 that measures the
strength of association between X (IV) and Y (DV) The closer the
value of Pearsons r to 1.00, the stronger the association A value
of -1.00 is a perfect negative correlation A value of +1.00 is a
perfect positive correlation A value of 0 is a perfect zero
correlation A positive Pearsons r = positive or direct relationship
A negative Pearsons r = negative or inverse relationship
Slide 8
Y X Mean of the Y variable Mean of the X variable Y mean X mean
An X value below the X mean correlates with a Y value below the Y
mean in the lower left quadrant (leads to + coefficient) An X value
above the X mean correlates with a Y value above the Y mean in the
upper right quadrant (leads to + coefficient) Very few outliers in
the opposing two quadrants
Slide 9
Y X Mean of the Y variable Mean of the X variable Y mean X mean
An X value below the X mean correlates with a Y value above the Y
mean in the upper left quadrant (leads to - coefficient) An X value
above the X mean correlates with a Y value below the Y mean in the
lower right quadrant (leads to - coefficient) Very few outliers in
the opposing two quadrants
Slide 10
The coefficient of determination is the squared value of
Pearsons r expressed as an absolute value (+) percentage Pearsons R
2 is a measure of the percent of variation in the DV explained (or
accounted for) by the variation in the IV Example: If r = +0.849,
then R 2 = 0.721 Interpretation: Roughly 72.1% of the variation in
the DV can be explained by the variation (changes) in the IV
Slide 11
The regression equation: Y = a + bX Regression analysis and the
regression equation are used to predict the best-fit regression
line from the X-Y data Simply hand-drawing a best-fit line through
a scatter plot is subjective and unreliable We need to use a
precise statistical method to estimate the true best-fit regression
line Estimated Y value (Y) = Y-intercept + slope(given value of X)
Least-squares principle The best-fit regression line is
statistically estimated by minimizing the sum of the squared
vertical difference between the actual Y values (Y) and the
predicted Y values (Y) Minimizing the distances between the
best-fit line (Y) and the actual values of Y Minimizing (Y Y)
2
Slide 12
Multiple = more than two variables (more precise, more thorough
than simple bivariate regression analysis) Exploring the
relationship between a dependent variable (DV) and two or more
independent variables (IV) Variables must be interval or ratio
level (continuous) Dependent Variable (DV): The variable
(measurable data used to operationalize a concept) that is thought
to depend upon or be influenced by another The variable the value
of which is being predicted or estimated Independent Variables
(IVs): The variables that are, together, thought to (hypothesized
to) influence the behavior of the DV. The IVs are sometimes
referred to as predictor variables; together, they may predict the
behavior of the DV We utilize the values of IVs to predict or
estimate the value of a DV in a regression equation: Y = a + b 1 X
1 + b 2 X 2 b n X n
Slide 13
Multiple regression analysis allows us to investigate the
relationship or correlation between several IVs and a continuous DV
while controlling for the effects of all the other IVs in the
regression equation In other words, we can observe the impact of a
single IV on a DV while controlling for the effects of several
other IVs simultaneously Multiple regression allows us to hold
constant the other IVs in the equation so that we can analyze the
impact of each IV on the DV net of the disturbances of other
factors See Grimm and Yarnold, 1995; Gujarati, 1995; Kennedy, 2008;
Tabachnick and Fidel, 2012
Slide 14
For each value of X (IV) there is a group of Y (DV) values, and
these values must be normally distributed The means of these Y
values lie on the predicted regression line The DV must be a
continuous variable (ratio or interval), not categorical The
relationship between the DV and all IVs must be linear, not
curvilinear The mean of the residuals (Y-Y) must equal 0 The DV is
statistically independent, no autocorrelation with itself (i.e.,
the DV cannot be autocorrelated with successive observations of
itself; one of the DV values cannot have influenced another of the
DV values, such as often occurs in time-series data)
Homoscedasticity: the values of the Y-Y residuals must be equal
over the entire range of the predicted regression line; must be the
same for all values of X, cannot be heteroscedastic (Kennedy, 2008)
Multiple IVs included in the regression model cannot suffer from
multicollinearity with each other
Slide 15
Y X values of the DV Y values of the IV X The error terms or
residuals Y-Y are not equal along the entire regression line. As
the value of the IV increases, the Y-Y residuals get larger and
larger, and the data points fan out wider about the regression
line.
Slide 16
Standard or Simultaneous Multiple Regression Technique All IVs
are entered into the model simultaneously; reveals only the unique
effects of each IV on the DV A single model is constructed with all
IVs included at the same time Hierarchical or Sequential Multiple
Regression Technique Sets of IVs are entered into each regression
model systematically, perhaps one by one Allows the analyst to
determine how much additional variance in the DV (R 2 ) is
explained by adding consecutive additional IVs in a systematic
pattern Multiple regression models are generated with each
successive model exhibiting more IVs than the previous models
(Grimm and Yarnold, 1995).
Slide 17
The first regression estimation (model) contains one or more
predictors The next estimation (model) adds one or more new
predictors to those used in the first analysis The change in R 2
between consecutive models represents the proportion of variance in
the DV shared exclusively with the newly entered variables Caution:
the partial coefficients on each consecutive model are not directly
comparable to one another. The impact of the variables entered in
earlier steps are partialed from correlations involving variables
entered in later steps (Grimm and Yarnold, 1995)
Slide 18
Standard or simultaneous multiple regression estimation on a
set of IVs SPSS (Sweet and Grace-Martin, 2011) Variables: DV:
county median household income 2011 IVs: Geographic border to metro
county Educational attainment measures Interstate highway density
Population density Labor force participation rate
Slide 19
Slide 20
Hierarchical multiple regression estimation using same IVs
entered consecutively or sequentially Variables: DV: county median
household income 2011 IVs: Geographic border to metro county
Educational attainment measures Interstate highway density
Population density Labor force participation rate
Slide 21
Slide 22
Slide 23
Hierarchical multiple regression estimation using different
model and variables Variables: DV: county per capital income 2011
IVs: County workforce composition: farming and professional Water
amenities: square miles of county water area Taxation: aggregate
R/E taxes per capita Immigration: county share foreign-born
population
Slide 24
Slide 25
Slide 26
Slide 27
Berry, W.D. (1993). Understanding Regression Assumptions.
Newbury Park, CA: Sage Publications. Grimm, L.G. and Yarnold, P.R.
(1995). Reading and Understanding Multivariate Statistics.
Washington, D.C.: American Psychological Association. Gujarati,
D.N. (1995). Basic Econometrics, 3 rd edition. New York, NY:
McGraw-Hill/Irwin. Kennedy, P. (2008). A Guide to Econometrics, 6
th edition. New York, NY: Wiley-Blackwell. Lind, D.A., Marchal,
W.G. and Wathen, S. (2011). Statistical Techniques in Business and
Economics, 15 th edition. Boston, MA: McGraw-Hill/Irwin. Sweet, S.A
and Grace-Martin, K. (2011). Data Analysis With SPSS: A First
Course in Applied Statistics, 3 rd edition. Boston, MA: Allyn and
Bacon. Tabachnick, B.S. and Fidel, L.S. (2012). Using Multivariate
Statistics, 6 th edition. Boston, MA: Allyn and Bacon.
Slide 28
Cimasi, R.J., Sharamitaro, A.P., and Seiler, R.L. (2013). The
association between health literacy and preventable
hospitalizations in Missouri: Implications in an era of reform.
Journal of Health Care Finance, 40(2), 1-16. Ruiz-Palomino, P.,
Saez-Martinez, F.J., and Martinez-Canas, R. (2013). Understanding
pay satisfaction: Effects of supervisor ethical leadership on job
motivating potential influence. Journal of Business Ethics, 118,
31-43. Tartaglia, S. (2013). Different predictors of quality of
life in urban environment. Social Indicators Research, 113, 1045-
1053. Zhoutao, C., Chen, J., and Song, Y. (2013). Does total
rewards reduce core employees turnover intention? International
Journal of Business and Management, 8(20), 62-75.