Download - Faculty Research Workshop February 19, 2014 Tom Lehman, Ph.D. Professor, Department of Economics Indiana Wesleyan University.

Faculty Research Workshop February 19, 2014 Tom Lehman, Ph.D. Professor, Department of Economics Indiana Wesleyan University

Introduction to Correlation Analysis and OLS OLS Multiple Regression Analysis Basics OLS Hierarchical Multiple Regression Definition, Purposes and Technique Examples of OLS Hierarchical Modeling Q&A References

Correlation analysis hypothesis testing is designed to investigate if two or more variables are correlated or co-vary together, and if this covariance is statistically significant. Examples: What is the relationship between housing costs and monthly rental prices in urban housing markets? What is the relationship between GDP growth and growth in capital investment expenditures in the macroeconomy? What is the relationship between education levels and the unemployment rate in an urban economy? Does advertising expenditure increase sales revenues? If we spend $100,000 on ad costs, what are predicted gross sales? What is the relationship between county levels of educational attainment and county household income?

Bivariate = two variables Exploring the relationship between a dependent variable (DV) and a single independent variable (IV) Ideally both variables should be interval or ratio level (continuous) Categorical (nominal or ordinal) variables do not work as well in regression analysis (exception: dummy variables in multiple regression) Dependent Variable (DV): The variable (measurable data used to operationalize a concept) that is thought to depend upon or be influenced by another The variable the value of which is being predicted or estimated Independent Variable (IV): The variable that is thought to (hypothesized to) influence the behavior of the DV. The IV is sometimes referred to as a predictor variable; it may predict the behavior of the DV We utilize the values of an IV to predict or estimate the value of a DV in a regression equation: Y = a + bX

Y X values of the DV Y values of the IV X

Y X values of the DV Y values of the IV X The best fit line (a.k.a. the predicted regression line) assumes a linear relationship; it traces a path through the scatter plots that is, on average, equidistant between each data point. OLS regression attempts to minimize the sum of the squared distance between the observed data points and the regression line of predicted plots.

A computed value between -1.00 and +1.00 that measures the strength of association between X (IV) and Y (DV) The closer the value of Pearsons r to 1.00, the stronger the association A value of -1.00 is a perfect negative correlation A value of +1.00 is a perfect positive correlation A value of 0 is a perfect zero correlation A positive Pearsons r = positive or direct relationship A negative Pearsons r = negative or inverse relationship

Y X Mean of the Y variable Mean of the X variable Y mean X mean An X value below the X mean correlates with a Y value below the Y mean in the lower left quadrant (leads to + coefficient) An X value above the X mean correlates with a Y value above the Y mean in the upper right quadrant (leads to + coefficient) Very few outliers in the opposing two quadrants

Y X Mean of the Y variable Mean of the X variable Y mean X mean An X value below the X mean correlates with a Y value above the Y mean in the upper left quadrant (leads to - coefficient) An X value above the X mean correlates with a Y value below the Y mean in the lower right quadrant (leads to - coefficient) Very few outliers in the opposing two quadrants

The coefficient of determination is the squared value of Pearsons r expressed as an absolute value (+) percentage Pearsons R 2 is a measure of the percent of variation in the DV explained (or accounted for) by the variation in the IV Example: If r = +0.849, then R 2 = 0.721 Interpretation: Roughly 72.1% of the variation in the DV can be explained by the variation (changes) in the IV

The regression equation: Y = a + bX Regression analysis and the regression equation are used to predict the best-fit regression line from the X-Y data Simply hand-drawing a best-fit line through a scatter plot is subjective and unreliable We need to use a precise statistical method to estimate the true best-fit regression line Estimated Y value (Y) = Y-intercept + slope(given value of X) Least-squares principle The best-fit regression line is statistically estimated by minimizing the sum of the squared vertical difference between the actual Y values (Y) and the predicted Y values (Y) Minimizing the distances between the best-fit line (Y) and the actual values of Y Minimizing (Y Y) 2

Multiple = more than two variables (more precise, more thorough than simple bivariate regression analysis) Exploring the relationship between a dependent variable (DV) and two or more independent variables (IV) Variables must be interval or ratio level (continuous) Dependent Variable (DV): The variable (measurable data used to operationalize a concept) that is thought to depend upon or be influenced by another The variable the value of which is being predicted or estimated Independent Variables (IVs): The variables that are, together, thought to (hypothesized to) influence the behavior of the DV. The IVs are sometimes referred to as predictor variables; together, they may predict the behavior of the DV We utilize the values of IVs to predict or estimate the value of a DV in a regression equation: Y = a + b 1 X 1 + b 2 X 2 b n X n

Multiple regression analysis allows us to investigate the relationship or correlation between several IVs and a continuous DV while controlling for the effects of all the other IVs in the regression equation In other words, we can observe the impact of a single IV on a DV while controlling for the effects of several other IVs simultaneously Multiple regression allows us to hold constant the other IVs in the equation so that we can analyze the impact of each IV on the DV net of the disturbances of other factors See Grimm and Yarnold, 1995; Gujarati, 1995; Kennedy, 2008; Tabachnick and Fidel, 2012

For each value of X (IV) there is a group of Y (DV) values, and these values must be normally distributed The means of these Y values lie on the predicted regression line The DV must be a continuous variable (ratio or interval), not categorical The relationship between the DV and all IVs must be linear, not curvilinear The mean of the residuals (Y-Y) must equal 0 The DV is statistically independent, no autocorrelation with itself (i.e., the DV cannot be autocorrelated with successive observations of itself; one of the DV values cannot have influenced another of the DV values, such as often occurs in time-series data) Homoscedasticity: the values of the Y-Y residuals must be equal over the entire range of the predicted regression line; must be the same for all values of X, cannot be heteroscedastic (Kennedy, 2008) Multiple IVs included in the regression model cannot suffer from multicollinearity with each other

Y X values of the DV Y values of the IV X The error terms or residuals Y-Y are not equal along the entire regression line. As the value of the IV increases, the Y-Y residuals get larger and larger, and the data points fan out wider about the regression line.

Standard or Simultaneous Multiple Regression Technique All IVs are entered into the model simultaneously; reveals only the unique effects of each IV on the DV A single model is constructed with all IVs included at the same time Hierarchical or Sequential Multiple Regression Technique Sets of IVs are entered into each regression model systematically, perhaps one by one Allows the analyst to determine how much additional variance in the DV (R 2 ) is explained by adding consecutive additional IVs in a systematic pattern Multiple regression models are generated with each successive model exhibiting more IVs than the previous models (Grimm and Yarnold, 1995).

The first regression estimation (model) contains one or more predictors The next estimation (model) adds one or more new predictors to those used in the first analysis The change in R 2 between consecutive models represents the proportion of variance in the DV shared exclusively with the newly entered variables Caution: the partial coefficients on each consecutive model are not directly comparable to one another. The impact of the variables entered in earlier steps are partialed from correlations involving variables entered in later steps (Grimm and Yarnold, 1995)

Standard or simultaneous multiple regression estimation on a set of IVs SPSS (Sweet and Grace-Martin, 2011) Variables: DV: county median household income 2011 IVs: Geographic border to metro county Educational attainment measures Interstate highway density Population density Labor force participation rate

Hierarchical multiple regression estimation using same IVs entered consecutively or sequentially Variables: DV: county median household income 2011 IVs: Geographic border to metro county Educational attainment measures Interstate highway density Population density Labor force participation rate

Hierarchical multiple regression estimation using different model and variables Variables: DV: county per capital income 2011 IVs: County workforce composition: farming and professional Water amenities: square miles of county water area Taxation: aggregate R/E taxes per capita Immigration: county share foreign-born population

Berry, W.D. (1993). Understanding Regression Assumptions. Newbury Park, CA: Sage Publications. Grimm, L.G. and Yarnold, P.R. (1995). Reading and Understanding Multivariate Statistics. Washington, D.C.: American Psychological Association. Gujarati, D.N. (1995). Basic Econometrics, 3 rd edition. New York, NY: McGraw-Hill/Irwin. Kennedy, P. (2008). A Guide to Econometrics, 6 th edition. New York, NY: Wiley-Blackwell. Lind, D.A., Marchal, W.G. and Wathen, S. (2011). Statistical Techniques in Business and Economics, 15 th edition. Boston, MA: McGraw-Hill/Irwin. Sweet, S.A and Grace-Martin, K. (2011). Data Analysis With SPSS: A First Course in Applied Statistics, 3 rd edition. Boston, MA: Allyn and Bacon. Tabachnick, B.S. and Fidel, L.S. (2012). Using Multivariate Statistics, 6 th edition. Boston, MA: Allyn and Bacon.

Cimasi, R.J., Sharamitaro, A.P., and Seiler, R.L. (2013). The association between health literacy and preventable hospitalizations in Missouri: Implications in an era of reform. Journal of Health Care Finance, 40(2), 1-16. Ruiz-Palomino, P., Saez-Martinez, F.J., and Martinez-Canas, R. (2013). Understanding pay satisfaction: Effects of supervisor ethical leadership on job motivating potential influence. Journal of Business Ethics, 118, 31-43. Tartaglia, S. (2013). Different predictors of quality of life in urban environment. Social Indicators Research, 113, 1045- 1053. Zhoutao, C., Chen, J., and Song, Y. (2013). Does total rewards reduce core employees turnover intention? International Journal of Business and Management, 8(20), 62-75.