+ All Categories
Home > Documents > Dummy variable in MLR

Dummy variable in MLR

Date post: 03-Jun-2018
Category:
Upload: rajv88
View: 235 times
Download: 0 times
Share this document with a friend

of 48

Transcript
  • 8/11/2019 Dummy variable in MLR

    1/48

    Today: Dummy variables.

    Dummy variables in a multiple regression, regression wrap up.

  • 8/11/2019 Dummy variable in MLR

    2/48

    Looking back in regression, weve looked at how an interval

    data response y changes as an interval data explanatory

    variable x. Changes.

    Example: Number of books read (y) as a function of television

    watched (x).

    Y = a + bX + e

  • 8/11/2019 Dummy variable in MLR

    3/48

    Last time, we expanded this idea to consider more than one

    explanatory / independent variable at the same time, where all

    the variables were interval data.

    This is called multiple regression.

    Example: Wins as a function of goals for and goals against.

    Y = a + b1X1 + b2X2 + e

  • 8/11/2019 Dummy variable in MLR

    4/48

    This time, were going to drop the requirement for the

    independent variables to be interval data. (Typo fixed)

    Were going to look at nominal data as independent data.

    Recall: Nominal means name. Its data in categorieswith

    no natural order.

    Example: Type of Fruit --- Kumquat, Coconut, Tomato,

    Dragonfruit.

  • 8/11/2019 Dummy variable in MLR

    5/48

    How do you put a type of fruit into a formula like this:

    = a + bXWith a dummy variable.

    Dummy in this case just means a simple numbervariable (0

    or 1) that we use in the place of nominal, and sometimes

    ordinal, data.

  • 8/11/2019 Dummy variable in MLR

    6/48

    Weve already used dummy variables.

    Bearded dragon gender: 0 = Male, 1 = Female

    Bearded dragon colour: 0 = Green, 1 = Fancy

    Other possibilities:

    0 = Non-Smoker, 1 = Smoker

    0 = Domestic Student, 1 = International Student

    0 = Eastern, 1 = Western

  • 8/11/2019 Dummy variable in MLR

    7/48

    Nominal data can have more than two categories, but we cant

    do this:

    Favourite colour:

    0 = Blue, 1 = Green, 2 = Red

    This would imply an order, and that having a favourite colour

    of green is somehow the middle ground between favouring

    blue and favouring red.*

    *If we cared about wavelength of favourite perhaps, but usually not

  • 8/11/2019 Dummy variable in MLR

    8/48

    Ordinal data can made into a 0,1,2, scale, as long as we

    assume the differences between each category and the next

    one are about the same.

    0 = Against, 1 = Neutral, 2 = For

    Or

    -1 = Against, 0 = Neutral, 1 = For

    Then were treating the ordinal data like interval data.

    Handling more than two categories is a for-interest topic, at

    the end of the lecture if time permits.

  • 8/11/2019 Dummy variable in MLR

    9/48

    Its all just words until we get up and do something about it.

  • 8/11/2019 Dummy variable in MLR

    10/48

    Dummy variables in regression:

    Consider the NHL data set. Lets see the difference in

    defensive skill between the Eastern and Western conferences,

    and by how much.

    Dependent variable: Goals against. (More goals against means

    weaker defence)

    Independent variable: Conference. (East or West)

  • 8/11/2019 Dummy variable in MLR

    11/48

    In our data set, we have conference listed in two different

    ways. ConfName: E or W. Conf: 0 or 1.

    0 = Eastern Conference, 1 = Western Conference.

    ConfName is for when we need conference as nominal.

    Conf is our dummy variable for when we need interval data.

  • 8/11/2019 Dummy variable in MLR

    12/48

    We can do a regression by using Conf as our independent.

    (SPSS wont even let you put Confname in)

    (Done under AnalyzeRegressionLinear)

  • 8/11/2019 Dummy variable in MLR

    13/48

    We get this model summary.

    The conference alone explains .122 of the variance in goals

    against.

    Theres a lot to goals against that isnt explained simply by

    whether you are in the Eastern or Western Conference.

  • 8/11/2019 Dummy variable in MLR

    14/48

    We get these coefficients.

    The prediction formula is:

    (Goals against) = 232.86717.333(Conference)

    The intercept is the response (Goals against) when the

    explanatory variable x = 0.

    Here, x=0 means Eastern Conference.

  • 8/11/2019 Dummy variable in MLR

    15/48

    The intercept is the average Goals Against of teams in the

    Eastern Conference.

    The slope is the amount that (Goals Against) changes when

    (Conference) increases by 1.

    Changing x=0 to x=1 means switching for the Eastern to the

    Western Conference.

  • 8/11/2019 Dummy variable in MLR

    16/48

    So the slope b is the difference in mean goals against between

    the conferences.

    Here, Western Conference teams let in 17.333 fewer goals.

    Plugging in x=0 or 1

    232.86717.333(0) = 232.867goals against if East

    232.86717.333(1) = 215.534goals against if West

  • 8/11/2019 Dummy variable in MLR

    17/48

    Since theres only one independent variable, and its nominal,

    so we COULD do this with a two-tailed independent t-test.

    AnalyzeCompare MeansIndependent-Sample T Test

    ConfName would be the grouping variable.

  • 8/11/2019 Dummy variable in MLR

    18/48

    We would get the same results:

    A difference of 17.333 and a 2-tailed p-value of 0.059.

    So why do we bother with regression and dummy variables at

    all?

  • 8/11/2019 Dummy variable in MLR

    19/48

    Greenland has the fastest moving glaciers in the world.

  • 8/11/2019 Dummy variable in MLR

    20/48

    Multiple regression using a dummy variable.

    Lets go back to predicting wins.

    Before, we modelled wins using goals for (GF) and goals

    against (GA). Now we can consider conference alongside

    everything else.

    Your conference (East or West) is part of what determines the

    teams you play against. Teams that play against weak

    opponents tend to win more.

  • 8/11/2019 Dummy variable in MLR

    21/48

    Will conference explain anything about wins that Goals For and

    Goals Against cant?

    In an SPSS multiple regression, we just include the dummy

    variable in the list of independents like everything else.

  • 8/11/2019 Dummy variable in MLR

    22/48

    First, the model summary.

    Considering goals for, goals against AND conference.

    82.9% of the variance in the number of wins can be explained

    by these three things together.

  • 8/11/2019 Dummy variable in MLR

    23/48

    Going back to last day, considering only Goals For and Goals

    Against, we also got an R square of 0.829.

    In other words, adding conference into our model told us

    nothing moreabout wins than goals werent already

    covering.

  • 8/11/2019 Dummy variable in MLR

    24/48

    The R square of the model is the same with or without

    conference.

    That means just as much variance is explained by considering

    only goals for/against as by considering both goals for/against

    and the conference of the team.

    Conference contributes nothing extra.

    This is probably because the strength of your opponents is

    already reflected in the goals for / goals against record. Its not

    like goals against weak teams count for more.

  • 8/11/2019 Dummy variable in MLR

    25/48

    The coefficient table for Wins as a function of Goals

    For/Against and Conference:

    The fact that conference isnt improving the model any isreflected in its significance.

    If its slope were really zero, wed still a sample like this .952 of

    the time. (p-value = .952)

  • 8/11/2019 Dummy variable in MLR

    26/48

    The regression equation is:

    (Estimated Wins) =

    37.637 + 0.178(GF)0.167(GA) + 0.082(West Conf.)

    Meaning being in the west meant winning 0.082 more games.

  • 8/11/2019 Dummy variable in MLR

    27/48

    But

    (Estimated Wins) =

    37.637 + 0.178(GF)0.167(GA) + 0.082(West Conf.)

    is more complicated than

    (Estimated Wins) =

    37.950 + 0.177(GF)0.163(GA)

    which is the model from last day that ignored conference.

  • 8/11/2019 Dummy variable in MLR

    28/48

    But knowing the conference doesnt change anything.

    -The r2was .829 whether we included conference or not.

    -We failed to reject the null that the effect of conference

    was zero (controlling for Goals For/Against ).

    In that case, we can use the simpler model that only uses goals

    and not lose anything. We should always opt for a simpler

    model when nothing is lost in doing so.

    This is called theprinciple of parsimony.

  • 8/11/2019 Dummy variable in MLR

    29/48

    "Make everything as simple as possible, but not simpler."

    -

    Nikola Tesla Albert Einstein

  • 8/11/2019 Dummy variable in MLR

    30/48

    Comments about r2in multiple regression.

    Like with single variable regression, r2must be between 0 and

    1.

    0 is none of the variance is explained.

    1 is all of it is explained.

    If you add more and more variables into your model, you willeventually reach r

    2= 1, where you have enough data to model

    and predict the response perfectly.

  • 8/11/2019 Dummy variable in MLR

    31/48

    But each variable uses up a degree of freedom and makes the

    results harder to interpret.

    Just because you can include a variable doesnt mean you

    should.

    (Resting heart rate) = a + b1(Age) + b2(Body Mass Index) + b3(L

    of Oxygen per Minute) + b4(Height) + b5(Number of Freckles) +

    b6(Enjoyment of Sushi) + b

    7(Kitchen Sinks Owned)

    Again, this violates theprinciple of parsimony.

  • 8/11/2019 Dummy variable in MLR

    32/48

    More regression practice.

    From dragons.sav, we have the weight of bearded dragons as a

    function of their age, length, and sex.

    What is the intercept?

  • 8/11/2019 Dummy variable in MLR

    33/48

    Weight of beardies as a function of age, length, and sex.

    What is the intercept?

    -551.125

    What does it mean?

  • 8/11/2019 Dummy variable in MLR

    34/48

    Weight of beardies as a function of age, length, and sex.

    What is the intercept?

    -551.125

    What does it mean?

    A malebearded dragon with 0 years, 0 length,weighs

    negative 551 grams. (not real-world useful)

  • 8/11/2019 Dummy variable in MLR

    35/48

    How much heavier is a bearded dragon if it ages two years anddoesnt get any longer or change sex? (On average)

    The slope for age is 17.191, so a dragon would get

    2 * 17.191 = 34.382 grams heavier with 2 extra years

    (controlling for length and sex)

  • 8/11/2019 Dummy variable in MLR

    36/48

    Is there a significant difference in weight between male andfemale dragons of the same age and size?

  • 8/11/2019 Dummy variable in MLR

    37/48

    Is there a significant difference in weight between male andfemale dragons of the same age and size?

    No. The p-value against there being no difference is .441, so

    wefail to reject that null.

  • 8/11/2019 Dummy variable in MLR

    38/48

    What does the regression equation look like?

  • 8/11/2019 Dummy variable in MLR

    39/48

    What does the regression equation look like?

    (Esimated Weight) =

    -551.1 + 17.1(Age) +34.3(Length) + 4.9(Female)

  • 8/11/2019 Dummy variable in MLR

    40/48

    How much does the average bearded dragon weight if hes..

    -Male

    -3 Years Old

    -

    24 cm long

  • 8/11/2019 Dummy variable in MLR

    41/48

    How much does the average bearded dragon weight if hes..

    -Male

    -3 Years Old

    -24 cm long

    (Esimated Weight) =

    -551.1 + 17.1( 3) + 34.3( 24 ) + 4.9( 0 )

    = 323.4 grams

  • 8/11/2019 Dummy variable in MLR

    42/48

    Is there a model that likely works just as well but is simpler?

  • 8/11/2019 Dummy variable in MLR

    43/48

    Yes. Its likely that a model without considering sex would

    explain nearly as much of the variance.

    From model summaries:

    Model with Age, Length, Sex: r2

    = .912

    Model with Age, Length: r

    2

    = .912 (Not always so exact)

  • 8/11/2019 Dummy variable in MLR

    44/48

    For interest: Nominal data of 3+ categories.

    Dummy variables HAVE to be 0 or 1. If not, youre treating

    nominal categories as if they have some sort of order.

    If you have 3 categories, you need 2 dummy variables.

  • 8/11/2019 Dummy variable in MLR

    45/48

    Each of the dummy variables is 1 only when a particular

    category comes up, and 0 all the other times.

    One of the categories is considered a baseline, or starting

    point. All of the dummy variables will be 0 for that category.

    (Here: Blue is the baseline, all the dummy variables are 0 for it)

  • 8/11/2019 Dummy variable in MLR

    46/48

    Since a colour cant be red and green at the same time, only

    one of the dummy variables will ever be 1 for a particular case.

    Doing a linear model with just these two dummy variables

    would look like:

    =a + b1(Red) + b2(Green)Which would be

    = a for blue cases.

    = a + b1 for red cases.

    = a + b2 for green cases.

  • 8/11/2019 Dummy variable in MLR

    47/48

    =a + b1(Red) + b2(Green)

    a , the intercept, the value when Red=0 and Green=0

    is the average response for blue cases.

    b1 is the average increase/decrease in the response when

    the case is green instead of blue.

    b2 is the average increase/decrease in the response when

    the case is red instead of blue.

  • 8/11/2019 Dummy variable in MLR

    48/48

    Next time: Midterm 2 post-mortem.

    Reintroduction to contingency, Odds and Odds Ratios.


Recommended