STA 6236 Regression Analysis
Dr. Mark E. Johnson Fall 2014
1
STA 6236 Regression Analysis
Mark E. Johnson* Fall 2014
*Power point slides modified from those developed by C. Nachtsheim, 2007, adapted 2008, 2009. Further modified 2010, 2011,2013) 2
Aside from Roster InformaCon
Survey: What have you had in staCsCcs and what do you expect, hope for this semester?
3
A LiOle MaOer of Recall
Second sheet to be filled out now if you would be so kind Take a quick look…
4
Intros please
5
At long last, the exciCng syllabus
6
1
7
8
9
CorrecCon (first 3 4222’s should be 6236)
10
11
12
Addendum to the Syllabus
• Latest version kept on the web site for the course
13
How to Get JMP Pro 11
• MAP 150D or 150E (Math and Physics, Data Mining Lab)
• Free for enrollees in my class • Can be loaded on Mac and Non-‐Mac machines • Bring laptop or PC to lab (must be done on site) • Promised to be on all UCF systems this fall… • AlternaCve socware at your own discreCon, risk and challenges (provided it does all the good things JMP Pro 11 does…good luck!)
14
Sites for downloading are set up
• Feel free to start reading the text • Tutorials on JMP • Data sets from JMP • Data sets with text
15
Brief introduction to JMP Pro 11 abbreviated for now other than
miscellaneous illustrations week 1 • Tutorial as part of the package
– Data files – ch1ta1
• Will use in class for all calculations • You need it to produce output to bring to
quizzes and for assignments to be submitted • Play with the package, try it out on data,
become proficient • What does this part of the output tell me?
16
Let’s try one of JMP’s data sets
• Not my favorite in that x and y are generic (i.e., made up or disguised/proprietary)
• Just for illustraCon
17
August 20, 2014 (Class #2)
• Bubble plot (you tube video okay on this) • Intern posiCon given on class web site • JMP Pro 11 both MAC and non-‐MAC loadable • Word or two on the “stuff you goOa know” • Regression and Big Data, terminology perspecCve • “Regression” thank you Mr. Galton • Regression to the mean • Simple examples, illustraCon of normal dist.
18
SLOPE.jmp file
Not sure what x, y represent L File SLOPE.jmp from the sample data sets in JMP Slope of fiOed line looked “Goofy” and potenCally embarrassing All unexpected results are learning opportuni-es 600 data points? Seriously? Seriously? … Repeats or close-‐to-‐repeats Finding the culprits…
19
Demo type stuff
Analyze distribuCon mode of 62 (histogram not so clear)
Bubble plot Stack x, y and jiOer the points (jiOer only for box plots in one-‐way)
20
Status of Stuff…
• JMP Pro 11. Site license in play? • Power point slides to be loaded this weekend.
21
Stuff you should know (eventually)
Normal density…an “easy” way to remember it Gamma distribuCon (easiest skewed distribuCon to remember and includes χ 2) Let’s look at some plots… Probability integral transform (well worth knowing) Sample mean and sample variance (22; 10) CLT… allows us to actually do something when our data is not normal Skewness as in skewed; kurtosis not so well known
22
23
Density funcCons for distribuCons having mean 0, variance 1, skewness zero and kurtosis = 3. (kurtosis does not equal peakedness.)
• Wikipedia has a lot of good stuff; knowing when the 5% or so that is goofy is actually goofy is a challenge.
24
ElecCon Data per Gomez et al.
The Republicans Should Pray for Rain: Weather, Turnout, and VoCng in U.S. PresidenCal ElecCons Brad T. Gomez University of Georgia, Thomas G. Hansford University of California, Merced, George A. Krause University of PiOsburgh
The Journal of Poli-cs, 2007
25
Basic Premises
Goal is to do the analysis RIGHT (i.e., sans Cme constraints or need to take short cuts) (If we had 10,000 variables, … ) Brainpower + analyCcal tools Track down data aberraCons, issues, etc. ElecCon data file guide to clean up
26
What is Regression?
1. Method of modeling relationships between a response variable Y and one or more predictors X. (also known as dependent/endogenous variable Y and independent/exogenous variables X)
2. A way of “fitting line (or curve) through data”
27
ISO 3534-‐3 terminology standard 3.3 regression analysis
collecCon of procedures associated with assessing models relaCng predictor variables to response variables NOTE 1 Regression analysis is commonly associated with the process of esCmaCng the parameters of anassumed model by opCmizing the value of an objecCve funcCon (for example, minimizing the sum of squared differences between the observed responses and those predicted by the model). The existence of staCsCcal socware packages has eliminated much of the drudgery in obtaining parameter esCmates, their standard errors, and contain a wealth of model diagnosCcs.
28
See New Work Item on PredicCve AnalyCcs to get a sense of DirecCon
PredicCve analyCcs encompasses the body of staCsCcal knowledge supporCng the analysis of massive data sets. Massive data sets
automated data collecCon associated with remote sensing transacConal (on-‐line) purchases web site browsing and viewing paOerns social media (networks and interacCons)
Goal: Extract useful informaCon (= acConable items) Challenges: 1000s of explanatory variables, unstructured data OpportuniCes: (sufficient data for validaCng models) Getng started: core staCsCcal methodologies (e.g., regression analysis) remain highly relevant although the usual emphasis on inference and hypothesis tesCng gives way to esCmaCon and predicCon. Massive data sets has forced pracCConers to rethink methodologies: assess the strengths and limitaCons; extend where possible or to develop new techniques to take advantage of or to cope with the data size magnitudes
29
Why regression analysis? Basic Objectives please.
• Relationship between variables (response variable with explanatory variables; dependent with independent variables)
• Prediction (BIG DATA driver) • Identify key variables of interest • Clean up data set (data preparation) • Test scientific hypotheses (not direct jump
to validate prior hypotheses)
30
Regression—Two Extreme “Schools” of Thought Bracket the Area
• Don’t try this at home, I am a professional: – Tell me what you have done and I will gleefully point
out all of the errors, misunderstandings, and so forth. Only experts should be permitted to apply regression techniques, let alone use sophisticated software such as JMP Pro 11. Otherwise, only junk will be produced.
• Give it a try, what can possibly go wrong: – One will always learn something from a detailed
regression analysis. Thank goodness for JMP Pro 11 to eliminate the drudgery in the computations. Try your best and you can always ask for forgiveness before re-running your analysis with a friendly expert’s help.
31
Why “Regression?” Regretable term to some extent! Common usage… Galton, late 1800s: Average height of sons—at a given
fathers height—tends to “regress” toward the mean of the population (mediocrity)
5.0 5.5 6.0 6.5
5.0
5.5
6.0
6.5
Fathers
Sons
Pop Average
Y=X
Regression
32
Regression to Mean Gets “Rediscovered” Periodically
• Note story from Jordan Ellenberg’s excellent liOle book: How Not to be Wrong in Mathema-cal Thinking, Penguin Press 2014.
33
34
More Galton Stuff • Galton devoted much of his life to the study of variaCon in
human populaCons and it was during his studies about heredity (the passing of traits from parents to their offspring) that he introduced the concept of regression. However, he did not use this term as staCsCcians do now (when referring to the fitng of linear relaConships); instead he was referring to a very specific staCsCcal phenomenon known as regression to the mean.
• InvesCgaCng the relaConship between the heights of parents and their children, Galton ploOed the heights of 930 children who had reached adulthood against the mean height of their parents. To account for differences due to gender he increased female heights by a factor of 1.08.
35
• “It appeared from these experiments that the offspring did not tend to resemble their parents in size, but always to be more mediocre than they – to be smaller than the parents, if the parents were large; to be larger than the parents, if the parents were small.”
36
Horace Secrist
• Prof. of stat at Northwestern, Dir. Bureau for Business Research
• From 1920, collected and compiled massive data on businesses to determine who fails/wins
• In 1933 The Triumph of Mediocrity in Business – Extremes (good or bad) headed for the middle
• So the Great Depression was like inevitable?
37
Another interpretaCon
• Height effected by many things (geneCcs, nutriCon, lots of liOle random, lucky things)
• Same for businesses…extremes somewhat lucky and even though they had superior methods/pracCces, others could have luck in their favor as well (random fluctuaCons in Cme)
• Think of several people flipping coins, ask the most heads and least heads to flip again…
38
Hotelling enters the fray… • Alg. Topologist, bright guy (monopoly in his head) • His take on Secrist: “The labor of compilaCon and of direct collecCon of data must have been giganCc” however, all of these tables and graphs merely “prove nothing more than that the raCos in quesCon have a tendency to wander about” + “mathemaCcally obvious from general consideraCons and does not need the vast accumulaCon of data adduced to prove it” results should work backward in Cme, but they don’t
• In other words, Secrist had wasted ten years of his life
39
• Also, published in JASA, but Secrist really didn’t get it • So Hotelling stops being Mr. Nice Guy: • “The thesis of the book when correctly interpreted is essenCally
trivial…. • To ‘prove’ such a mathemaCcal result by a costly and prolonged
numerical study of many kinds of business profit and expense raCos is analogous to proving the mulCplicaCon table by arranging elephants in rows and columns, and doing the same for numerous other kinds of animals.
• The performance, though perhaps entertaining, and having a certain pedagogical value, is not an important contribuCon
• Either to zoology or mathemaCcs”
40
Brownlee stack loss data
• Once upon a Cme the profession focused its aOenCon on a liOle data set of 21 observaCons collected before color televisions became available for homes!
41
Example 1: Store Site Selection
Model sales Y at existing sites as a function of demographic variables:
X1 = Population in store vicinity X2 = Income in area X3 = Age of houses in area X4 = Unemployment rate X5 = Traffic data
From equation, predict sales at new sites 42
Example 2: Marketing Research
Model consumer response to a product on basis of product characterisCcs:
Y = Taste score on soc drink
X1 = Sugar level X2 = CarbonaCon level X3 = Ice/No Ice X4 = _______
43
Example 2: Marketing Research
Model consumer response to a product on basis of product characterisCcs:
Y = Taste score on soc drink
X1 = Sugar level X2 = CarbonaCon level X3 = Ice/No Ice X4 = sugar content X5 = cost
44
Example 3: Arson Forensics Understand burning processes for baselines of arson investigations
Y’s = time until flame out, max temp. in room, time until max temp. reached, average temp. in room overall and at individual locations, depth of char and bubble size throughout the room
X1 = Fuel type: Gasoline or Kerosene mix X2 = Ignitable fuel amount X3 = Ignitable fuel placement X4 = Additional materials on sofa X5 = Window 1 openness X6 = Window 2
45
Example 4: Real Estate Pricing
Y = Selling price of houses X1 = Square footage X2 = Taxes X3 = Lot acreage X4 = Houses in area foreclosed X5 = Rating of neighborhood school X6 = Distance/time to downtown
46
One-Predictor Regression (Chris Nachtsheim example)
533 Homes Sold in Minnetonka, MN 2001
47
One-Predictor Regression
500 1500 2500 3500 4500
0
500000
1000000
SqFt
Pric
e
Price = -1957.83 + 158.950 SqFt
S = 79122.9 R-Sq = 67.2 % R-Sq(adj) = 67.1 %
Regression Plot
48
One-Predictor Regression
500 1500 2500 3500 4500
0
500000
1000000
SqFt
Pric
e
Price = -1957.83 + 158.950 SqFt
S = 79122.9 R-Sq = 67.2 % R-Sq(adj) = 67.1 %
Regression Plot
49
Orlando Housing Market
• Bad news on housing diminishing, supposedly better the past year or so
• Zillow site for recent sales • Last year, looked at previous 30 days, sold
price, square footage, #bedrooms, baths, taxes
50
51
Little Demo in JMP Pro 11 with this data
• Getting data into a data table (steps skipped to go from Zillow to JMP)
• Looking at the data • Sales price as a function of variables • Multivariate • Predictive model? • Model for understanding? • Extent of generalizing possible….
52
Demo…
• Pull data into JMP Regress (v.) Price on sqc Check out fit(s) (with/without various data points; formula funcCon)
53
Used ZILLOW for 2013 recent data
• ZILLOW not great for downloads… • 65 observaCons
– 4+ bedrooms – 3+ bathrooms – non-‐missing lot size – Orlando area – “pool”
54